Leveraging the On-Chip FIR and IIR Hardware Accelerators on a Digital Signal Processor

Abstract

Finite impulse response (FIR) and infinite impulse response (IIR) filters are the most frequently used digital signal processing algorithms—especially for audio processing applications. Thus, a significant portion of a processor core’s time is consumed for FIR and IIR filtering in a typical audio system. The on-chip FIR and IIR hardware accelerators, also referred to as FIRA and IIRA, respectively, on a digital signal processor can be used to offload the FIR and IIR processing tasks, thus freeing up the core for other processing. In this article, we will discuss how to make use of these accelerators in practice with the help of different usage models illustrated with tested real-time examples.

Introduction

Figure 1. FIRA and IIRA system block diagram.

Figure 1 shows a simplified block diagram of FIRA and IIRA and how they interact with the rest of the processor system and resources.

Both the FIRA and IIRA blocks mainly consist of a compute engine—multiply and accumulate (MAC) units—along with a small local data and coefficient RAM.
To start the FIRA/IIRA processing, the core initializes a chain of DMA transfer control blocks (TCBs) in processor memory with channel-specific information. The core then writes the FIRA/IIRA chain pointer register with the start address of this TCB chain and then configures the FIRA/IIRA control register to start the accelerator processing. Once processing of all the channels is complete, an interrupt is sent to the core so that it can use the processed output for further operations.
Theoretically, the best approach is to offload all the FIR and/or IIR tasks from the core to the accelerator(s) and allow the core to do something else in parallel. But in practice, this may not always be feasible, particularly when the core needs to use the output from the accelerator(s) for further processing and has no other independent tasks to finish in parallel. In such cases, we need to choose the appropriate accelerator usage model to achieve the best results.

In this article, we’ll discuss various models to use these accelerators optimally for different application scenarios.

Using FIRA and IIRA in Real Time

Figure 2. Typical real-time audio data flow.

Figure 2 shows a typical real-time PCM audio data flow diagram. One frame of digitized PCM audio data is received over a synchronous serial port (SPORT) and sent to memory via direct memory access (DMA). While reception of frame N+1 is going on, frame N is processed by the core and/or the accelerator(s), and the output of the previously processed frame (N-1) is sent out via SPORT to the DAC for digital-to-analog conversion.

Accelerator Usage Models

As discussed earlier, depending upon the application, the accelerators may need to be used in different ways to offload the maximum FIR and/or IIR processing tasks and to save the most possible core cycles for other operations. At a high level, the accelerator usage models can be divided into three categories: direct replacement, split task, and data pipelining.

Direct Replacement

The core FIR and/or IIR processing is directly replaced by the accelerator(s) and the core simply waits for the accelerator(s) to finish the job.
This model is effective only when the accelerator can process faster than the core; that is, using the FIRA block.

Split Task

The FIR and/or IIR processing tasks are divided between the core and the accelerator(s).
This model is especially useful when multiple channels are available to be processed in parallel.
Based on a rough timing estimation, the total number of channels can be divided between the core and the accelerator(s) in such a way that both finish at approximately the same time.
As shown in Figure 3, this usage model results in more core-cycle savings than the direct replacement model.

Data Pipelining

The data flow between the core and accelerator(s) can be pipelined in such a way that both can work in parallel on different data frames.
As shown in Figure 3, the core processes the N^th frame and then initiates the accelerator’s processing of this frame. The core then continues in parallel to further process the N-1^th frame output produced by the accelerator(s) in the previous iteration. This sequence allows the complete offloading of the FIR and/or IIR processing task to the accelerator(s) at the cost of additional output latency.
The pipeline stages and, consequently, the output latency can increase depending upon the number of such FIR and/or IIR processing stages in the complete processing chain.

Figure 3 illustrates how the audio data frames flow between three stages—DMA IN, core/accelerator processing, and DMA OUT—for various accelerator usage models. It also shows how free core cycles increase compared to the core only model by fully or partially offloading the FIR/IIR processing to the accelerator across different accelerator usage models.

Figure 3. Accelerator usage models comparison.

FIRA and IIRA on SHARC Processors

The following Analog Devices SHARC^® processor families support on-chip FIRA and IIRA (oldest to the newest).

Across processor families:

Compute speed varies.
The basic programming model remains the same except for the auto configuration mode (ACM) on ADSP-2156x processors.
FIRA has four MAC units while IIRA has a single MAC unit.

FIRA/IIRA Improvements on ADSP-2156x Processors

ADSP-2156x is the latest addition to the SHARC processor family. It is the first 1 GHz single-core SHARC processor with FIRA and IIRA also capable of running at 1 GHz. FIRA and IIRA on ADSP-2156x processors offer various improvements over its predecessors, the ADSP-SC58x/ADSP-SC57x processors.

Performance Improvements

Compute speed increased by eight times (SCLK-125 MHz to CCLK-1 GHz).
Lesser data and MMR access latency between the core and the accelerators is possible due to closer integration of the core and the accelerators with the help of a dedicated core fabric.

Functional Improvements

Support for ACM was added to minimize core intervention required to handle accelerator processing. This mode comes with the following new, main features:

Allows halting of the accelerator for dynamic task queuing.
No channel number limitation.
Trigger generation (master) and trigger wait (slave) support.
Selective interrupt generation for each channel.

Experimental Results

In this section, we’ll discuss the results of two real-time multichannel FIR/IIR use cases implemented with the help of different accelerator usage models on an ADSP-2156x evaluation board.

Use Case 1

Figure 4 shows the block diagram of use case 1. The sample rate is 48 kHz, the block size is 256 samples, and the ratio of core to accelerator channels used in the split task model is 5:7.

Table 1 shows the measured core and FIRA MIPS numbers along with the resultant core MIPS savings compared to the core only model. The table also shows additional output latency added by the corresponding usage model. As we can see, with accelerator usage, up to 335 core MIPS could be saved with a data pipelining usage model, at the cost of 1 block (5.33 ms) of output latency. Direct replacement and split task usage models also result in 98 MIPS and 189 MIPS savings, respectively, without any additional output latency.

Table 1. Core and FIR/IIRA MIPS Summary for Use Case 1
Usage Model	Core MIPS	FIRA MIPS	IIRA MIPS	Core MIPS Saving	Usage Model Latency (ms)
Core Only	337				0
Direct Replacement	239	162	75	98	0
Split Task	148	96	44	189	0
Data Pipelining	2	161	75	335	5.33 (1 frame)

Use Case 2

Figure 5 shows the block diagram of use case 2. The sample rate is 48 kHz, the block size is 128 samples, and the ratio of core to accelerator channels used in the split task model is 1:1.

Like Table 1, Table 2 shows the results for this use case. As we can see, with accelerator usage, up to 490 core MIPS could be saved with a data pipelining usage model at the cost of 1 block (2.67 ms) of output latency. A split task usage model results in 234 core MIPS savings without any additional output latency. Note that unlike in use case 1, frequency-domain (fast convolution) processing is used for the core instead of time-domain processing. This is the reason why the core MIPS taken to process one channel is less than the FIRA MIPS taken, which results in a negative core MIPS savings for the direct replacement usage model.

Table 2. Core and FIR/IIRA MIPS Summary for Use Case 2
Usage Model	Core MIPS	FIRA MIPS	Core MIPS Saving	Usage Model Latency (ms)
Core Only	493			0
Direct Replacement	515	511	–22	0
Split Task	259	257	234	0
Data Pipelining	3	511	490	2.67 (1 frame)

Conclusion

In this article, we have seen how significant core MIPS can be offloaded to FIRA and IIRA accelerators on ADSP-2156x processors, utilizing different accelerator usage models to achieve desired MIPS and processing profiles.