For Efficient Signal Processing in Embedded Systems, Take a DSP, not a RISC
by Jerry McGuire
Increasingly, electronic equipment applications involve signal processing. Home theatre, computer graphics, medical imaging and telecommunications all rely on signal-processing technology. Signal processing requires fast math in complex, but repetitive algorithms. And many applications require computations in real-time: i.e., the signal is a continuous function of time, which must be sampled and converted to digital, for numerical processing. The processor must thus execute algorithms performing discrete computations on the samples as they arrive.
The architecture of a digital signal processor (DSP) is optimized to handle such algorithms. The characteristics of a good signal processing engine include: fast, flexible arithmetic computation units (e.g., multipliers, accumulators, barrel shifters); unconstrained data flow to and from the computation units; extended precision and dynamic range in the computation units (to avoid overflow and minimize roundoff errors); dual address generators (for simultaneous handling of both inputs to a dyadic operation); efficient program sequencing (including ability to deal with loops and interrupts effectively); and ease of programming.
A DSP has some of these features in common with a reduced-instruction-set-computer (RISC). In addition, both are constructed around certain core instructions, enabling them to operate at very high instruction rates; and both eschew internal microcode. However, they are fundamentally different "animals". The differences between RISCs and DSPs are most pronounced in the processors'
DSPs belong to two basic classes: fixed point, a (typically) 16-bit architecture based on 16 bit integer data types, and floating point, usually with a 32-bit architecture, based on a data type that has both mantissa and exponent.
Figure 1. SHARC internal architecture.
Computational Units: DSPs all contain parallel hardware multipliers to support single-cycle multiplication, and their multipliers often combine multiplication and accumulation in a single cycle. DSPs have dedicated accumulators with registers significantly wider than the nominal word size to preserve precision-for example, 80 bits in the 32-bit ADSP-2106x SHARC family (Figure 1). Hardware may support recovery from accumulator overflows, as with the ADSP-21xx family. In addition, DSPs all contain full-featured arithmetic-logic units (ALUs), independent of the multiplier.
The ALU may have special features, such as the ability to produce simultaneous sums and differences to accelerate the kernel routine in the fast Fourier transform (FFT)-an algorithm for transforming signals between the time domain and the frequency domain. An advanced DSP will contain saturation logic in the computational units to prevent data overflow. It also may offer zero-overhead (i.e., without requiring additional clock cycles) traps to interrupt routines on arithmetic exceptions.
A sophisticated DSP may also contain a single-cycle barrel shifter (i.e., one capable of shifting a word an arbitrary number of bits left or right in one clock cycle), with a priority encoder for data scaling, data compression/expansion or packing/unpacking and bit manipulation. It may also include dedicated hardware to minimize the time required for fast division, square root, and transcendental-function calculation. Computational elements with these specialized features are not found on RISC processors.
Address Generation: An efficient DSP will keep its computational units fed with data from at least two independent data-address generators. Tapped delay lines and coefficient buffers are characteristic of DSPs, yet are mostly unknown in general-purpose computing. An efficient DSP needs circular-buffer hardware to support the buffers. Circular buffer pointers need to be updated every cycle without overhead. Furthermore, a comparison test for end-of-buffer needs a no-delay command to reset the pointer at the end-of-buffer. On the other hand, a RISC processor requires an additional cycle for each comparison test.
Memory Architectures: DSPs typically support system memory architectures that differ from those in general-purpose computing systems. DSPs utilize a Harvard architecture, which permits sustained single-cycle access to two words of data from two distinct external memories. Analog Devices SHARC DSPs, for example, feature 2 or 4-Mbits of dual-ported SRAM integrated on-chip. This memory is directly addressed-not a cache, as would be found in RISC processors. To the CPU, this on-chip memory looks like a unique piece of memory, not merely a high-speed replica of memory elsewhere in the system. The reason is that DSPs are typically embedded processors. Their on-chip memory is often adequate to contain the complete, repetitive DSP program necessary to the task. Each memory block is dual-ported for single-cycle, independent accesses by the core processor and I/O processor or DMA controller (Figure 2). The dual-ported memory and separate on-chip buses allow two data transfers from the core and one from I/O, all in a single cycle.
Interrupt capabilities: Because DSPs are intended for operation in real-time systems, efficient, sophisticated, and predictable interrupt handling is critical to a DSP. RISC processors, with their highly-pipelined architectures, tend to have slow interrupt response times and limited interrupt capabilities. Context switches should be very fast. Advanced DSPs, like the new ADSP-21csp01 and Analog Devices' ADSP-2106x floating-point family support complete sets of alternative registers, allowing a single-cycle switch of context to support interrupt handling. (Register-file windowing differs, in that its purpose is to accelerate parameter passing, not save an entire context.)
An advanced DSP will support at least four independent external interrupts in addition to internal interrupts. Interrupt latency will be kept to just a few cycles and must be predictable. Interrupts should be nestable and prioritizable. In addition, it should be easy to enable and disable particular interrupts in real time.
Hardware looping: Efficient looping is critical to digital signal processing because signal-processing algorithms are repetitive. A good DSP will support zero-overhead loops with dedicated internal hardware. That is, the chip will monitor loop conditions and terminations to decide-in parallel with all other operations-whether to increment the program counter or branch without cycle-time penalty to the top of the loop. A RISC processor, on the other hand, has to do a test-and-branch at the end of every loop, costing at least an additional cycle every loop and every pass. Nested loops are also very common in signal processing algorithms; the DSP looping hardware should support a depth of at least four levels of nested loops. RISC processors have yet to evolve to support these basic signal processing needs.
Conditional execution: Data-dependent execution is important for signal processing. For this reason, advanced DSPs, like the ADSP-2100 family and the ADSP-2106x SHARC floating-point family, support conditional execution of most of their basic instructions: in a single instruction, the processor tests a condition code and, if true, performs an operation in the same cycle. This can make an enormous difference to computationally intensive algorithms. Intel discovered this problem with the i860 and added a graphics unit to handle the conditional store operations necessary for high-performance Z-buffering.
Interfaces: DSPs operate on real-world signals coming from analog-to-digital converters, and they send their results to D/A converters. For this reason, DSPs often contain serial ports for an inexpensive interface to these devices. Advanced DSPs add hardware to make the operation efficient, for example, double-buffering and auto-buffering. Because these input/output signals may come from and go to nonlinear codecs, an advanced DSP may have dedicated hardware for zero-overhead A-law and µ-law companding. In addition, the serial ports may have features to simplify interfacing with T1 and CEPT data transmission lines.
Figure 2. The efficient SHARC memory architecture allows I/O bandwidth to keep up with computations.
The SHARC serial ports are designed to maximize throughput with hardware that is flexible, yet tuned to various signal types. Their features include: each serial port can automatically receive and/or transmit an entire block of data, independently transmit and receive-each with a data buffer register as well as a shift register, multi-channel mode for TDM.
Programming considerations: At one time, a significant difference between DSPs and RISCs was their programming models. DSP is inherently performance-driven, so programming of DSPs was done mostly in assembly language to get the best performance from the processor. This is generally still true for fixed-point DSPs, but much more easily so with the ADSP-2100 family's intuitive algebraic assembly language (Figure 3). At no sacrifice to performance, it ameliorates the ease-of-use issue that drives many programmers to favor high-level languages like C.
Figure 3. FFT code example.
On the other hand, floating-point DSPs are more efficiently programmed in high-level language. Floating-point calculations avoid fractional data types, which do not exist in C. In addition, architectural decisions can affect compiler efficiency. The large, unified address space of the ADSP-2106x SHARC family, for example, makes memory allocation easier for the compiler. In addition, their large, flexible register file improves efficiency.
Central to our product strategy is providing the tools and DSP cores that make it possible to efficiently program our fixed- and floating-point DSPs in high-level language. This is the driving force behind the ADSP-21csp, a new family of concurrent signal processors. Nevertheless, though using high-level languages, a DSP programmer must be able to descend in language level (with minimal pain) to improve performance of time-critical routines.
Increasingly, DSP designs are programmed in this sequence: first, a software prototype is written and debugged in a high-level language. This prototype often results in adequate performance. More generally however, increased performance will be required, so the high-level code is histogrammed in simulations to find the sections needing the most execution time. The critical sections are then hand-coded in assembly language. The histogramming and hand-coding process is iterated until performance targets are met.
While the differences between DSPs and RISCs are many, the two architectures tend to converge in the area of programming, a convergence driven by time-to-market and the evolving role of DSP in applications. Programmers, skilled at quickly developing working C programs, use them to bring product to market faster. Meanwhile, DSPs take on more system-management functions, such as the user interface or system control, and will need to offer high-level language efficiency to compete with the µCs and RISC processors formerly assigned these control tasks.