Hardware Accelerators Boost the Performance of Next-Generation SHARC Processors

2008年09月01日

Summary

The recently announced Analog Devices SHARC^® ADSP-2146x processor incorporates hardware accelerators for implementing three widely used signal processing operations: FIR (finite impulse response), IIR (infinite impulse response), and FFT (fast fourier transform). The accelerators offload the core processor and have the potential to more than double the computational throughput of the processor. This paper introduces the accelerators using their application in next-generation audio systems as an example.

Why Hardware Accelerators

The FIR filters, IIR filters, and FFT operations so commonly used in digital signal processing have a regular structure that allows direct implementation in hardware—specifically, hardware accelerators. These accelerators are dedicated fixed-function peripherals designed to perform a single computationally intensive task over and over. They offload the main processor, allowing it to do general-purpose tasks that have little regularity in structure.

Using hardware accelerators offers a cost-effective way to increase the overall computational power of a processor because the system designer gains the flexibility of a general-purpose processor coupled with the computational advantage of dedicated hardware.

Therefore, such accelerators are a valuable asset in meeting the demands of ever more complex systems in many application areas. One of those is audio systems, where the number of channels is on the rise. Home theater systems went from 5.1 channels, to 6.1, and now 7.1 channels. High end automotive amplifiers routinely use 12 or more speakers to immerse the listener in sound.

Furthermore, audio source material is now available in high definition (HD) formats whose associated decoders stretch system resources. In addition, HD algorithms provide content at a higher sampling rate. Previously the peak sample rate of content was rarely above 48 kHz. With HD algorithms, the sampling rate is routinely 96 kHz and, in some cases, even as high as 192 kHz.

To better understand how computational demands are increasing, think about the state-of-the-art home theater receivers that incorporate sophisticated room equalization algorithms. These algorithms compensate for variations in driver response and speaker placement. The algorithms first analyze the room using a microphone and real-time transfer function measurements. Measurements at multiple locations are intelligently combined, and then a compensating filter for each speaker is designed.

The more precise room equalization algorithms use FIR filters to correct the response over the entire frequency range. The length of the filter needed is directly proportional to the sampling rate, and longer filters are needed for precise control of low frequencies. Filter lengths of 256 points are common at 48 kHz, while achieving the same frequency resolution at 96 kHz requires a filter length of 512 points. This doubling of sampling rate and filter length leads to a fourfold increase in the amount of computation required.

Accelerators in the SHARC ADSP-2146x

SHARC processors from Analog Devices have a long history enabling sophisticated signal processing functionality in a multitude of applications. The processors’ feature-rich core and peripherals have made it a logical choice for product developers. Analog Devices’ recently introduced SHARC ADSP-2146x processor reinforces this leadership position with a higher clock speed (450 MHz) and expanded on-chip memory (5 Mb).

In addition, the processor features a set of hardware accelerators for implementing common signal processing operations: FIR filters, IIR filters, and FFTs. These operations form the basis of communication systems, medical devices, consumer products, and industrial measurement and control applications. These accelerators complement the on-board sample rate converter, which was introduced in the SHARC ADSP-2136x processor, and can also be considered to be a hardware accelerator.

Accelerator Architecture

All three accelerators for the SHARC ADSP-2146x have a similar design, which makes the FIR accelerator shown below a good illustration of the hardware accelerator architecture. The FIR accelerator has the following components:

Set of control registers—configures the operation of the accelerator.
DMA controller—moves data between main memory and the accelerator’s local memory. Can also be used to configure the control registers.
Two blocks of local memory—stores coefficients and state variables (or delay memory), and reduces the bandwidth to main memory.
Compute unit—contains arithmetic operations tailored to the accelerator. The FIR compute unit has four parallel MACs.

The operation of the accelerator is automated using chained DMA. The FIR accelerator typically progresses through the following steps:

Load the coefficient data for this channel from internal memory to the local accelerator coefficient storage.
Load the state variables for this channel from internal memory to the local accelerator state variable storage. This includes the first input sample.
Compute the output sample using the four MAC units.
Store the result.
If there are samples left to process, then fetch the next input sample and write to the state variable storage.
Repeat steps 3 to 5 until all the output samples in the channel are computed.
Repeat steps 1 to 6 for all input channels.

The ADSP-2146x core has a maximum clock rate of 450 MHz. By using SIMD (single-instruction multiple-data), the core can perform two MAC (multiply-accumulate) operations per clock cycle for a peak rate of 900 MMAC/sec. The accelerator, in comparison, operates at the SHARC peripheral clock rate of 225 MHz. Using its four dedicated MAC units, the FIR accelerator achieves a peak theoretical throughput of 900 MMAC/sec. There is some overhead for configuring the FIR accelerator control registers and moving data in and out of local memory.

The total number of peripheral clock cycles needed to implement a given FIR filter is given by the formula

equation1

where N is the number of filter taps and B is the block size. The cycle count can be broken down further into:

49 = DMA transfer control block initialization.

4N = loading the coefficient and state values (delay line) assuming two cycles per load.

equation2 = cycles to compute one output sample of a length NFIR filter.

Using the Accelerators in Practice

Application software must be designed to get the most out of the hardware accelerators. Keep in mind that the accelerators must be configured to operate in parallel with the main CPU because there is no benefit if the main CPU is idle waiting for the accelerators to finish.

The accelerators are typically part of a larger signal chain running within a real-time environment. Interfacing to the accelerators requires double-buffered input and output data, and the system designer should bear in mind that the accelerators introduce a block of latency.

Consider a home theater system with 7.1 channels of audio at 96 kHz operating at a block size of 32 samples. Assume that room equalization is being applied by eight FIR filters, each 512 points long. If the core CPU were to perform the filtering, it would take at least 96 kHz × 8 × 512 = 393 MMAC/sec or 44% of a 450 MHz SHARC processor. This FIR processing represents a significant portion of the overall computation and, fortunately, can be offloaded to the accelerator. The inputs and outputs to the FIR filters are double buffered allowing the accelerator to operate in parallel with the rest of the audio signal chain. The double buffering introduces 32 samples of delay in the processing, which is an acceptable 333 μs at 96 kHz.

Using the previous formula, the accelerator requires 50,056 peripheral cycles to complete the operation. At a rate of 225 MHz this is 223 μs, which is well within the 333 μs block time.

Conclusion

Continuing advances in audio processing technology are placing ever higher demands on audio DSPs. The hardware accelerators within the next-generation SHARC ADSP-2146x processor provide a significant boost in overall processing power. The accelerators offload common signal processing operations—FIR filters, IIR filters, and FFT operations—from the core processor allowing it to focus on other tasks. This cost-effective approach more than doubles the computational throughput of the processor. Although this paper focused on audio applications, the core and accelerators are general-purpose and well-suited for a variety of signal processing tasks.