TigerSHARC Architectural Backgrounder
TigerSHARC Architectural Backgrounder
The TigerSHARC® Processor is an ultra-high performance static superscalar DSP optimized for multi-processing applications requiring computationally demanding large signal processing tasks. This document describes the key features of the TigerSHARC Processor architecture that combine to offer the highest performance, flexibility, efficiency and scalability available to equipment manufacturers in the marketplace today.
As a static superscalar DSP, the TigerSHARC Processor core can execute simultaneously from one to four 32-bit instructions encoded in a single instruction line. With a few exceptions, an instruction line, whether it contains one, two, three or four 32-bit instructions, executes with a throughput of one cycle in an eight-deep processor pipeline. The TigerSHARC Processor has a set of instruction parallelism rules that programmers must follow when encoding an instruction line. In general, the selection of instruction the DSP can execute in parallel each cycle depends on the instruction line resources each requires and on the source and destination of registers used. The programmer has direct control of the three core components - the IALU, the Computation Blocks, and the Program Sequencer.
In most cases the TigerSHARC Processor has a two-cycle execution pipeline that is fully interlocked, so whenever a computation result is unavailable for another operation dependent on it, stall cycles are automatically inserted. Efficient programming with dependency-free instructions can eliminate most computational and memory transfer dependencies. All of the instruction parallel rules and data dependencies are documented in the TigerSHARC Processor User's Guide.
The TigerSHARC Processor also has the capability of supporting single-instruction, multiple-data SIMD operations through the use of both Computational Blocks in parallel as well as the use of SIMD specific computations. The programmer has the option of directing both Computation Blocks to operate on the same data (broadcast distribution) or different data (merged distribution). In addition, each Computation Block can execute four 16-bit or eight 8-bit SIMD computations in parallel.
In most cases the TigerSHARC Processor has a two-cycle execution pipeline that is fully interlocked, so whenever a computation result is unavailable for another operation dependent on it, stall cycles are automatically inserted. Efficient programming with dependency-free instructions can eliminate most computational and memory transfer dependencies. All of the instruction parallel rules and data dependencies are documented in the TigerSHARC Processor User's Guide.
The TigerSHARC Processor also has the capability of supporting single-instruction, multiple-data SIMD operations through the use of both Computational Blocks in parallel as well as the use of SIMD specific computations. The programmer has the option of directing both Computation Blocks to operate on the same data (broadcast distribution) or different data (merged distribution). In addition, each Computation Block can execute four 16-bit or eight 8-bit SIMD computations in parallel.
As mentioned above, the TigerSHARC Processor has two Computation Blocks that can operate either independently, in parallel or as a SIMD engine. The DSP can issue up to two compute instructions per Computation Block per cycle, instructing the ALU, multiplier or shifter to perform independent, simultaneous operations. The Computation Blocks each contain four computational units, an ALU, a multiplier, a 64-bit shifter, a CLU (ADSP-TS201S only) and a 32-bit register file.
The 32-bit word, multi-ported register files are used for transferring data between the computational units and data buses, and for storing intermediate results. Instructions can access the registers in the register file individually (word-aligned) or in sets of two (dual-aligned) or four (quad-aligned). The ALU performs a standard set of arithmetic operations in both fixed-point and floating-point formats, while also performing logic operations. The multiplier performs both fixed-point and floating-point multiplication as well as fixed-point multiply and accumulates. The 64-bit shifter performs logical and arithmetic shifts, bit and bit-stream manipulation, and field deposit and extraction.
The 32-bit word, multi-ported register files are used for transferring data between the computational units and data buses, and for storing intermediate results. Instructions can access the registers in the register file individually (word-aligned) or in sets of two (dual-aligned) or four (quad-aligned). The ALU performs a standard set of arithmetic operations in both fixed-point and floating-point formats, while also performing logic operations. The multiplier performs both fixed-point and floating-point multiplication as well as fixed-point multiply and accumulates. The 64-bit shifter performs logical and arithmetic shifts, bit and bit-stream manipulation, and field deposit and extraction.
The CLU on the ADSP-TS201S is a 128-bit unit which houses enhanced acceleration instructions specifically targeted at increasing the amount of Complex Multiplies per cycle and improving the Decoding efficiency of the TigerSHARC device. The CLU is not available on the ADSP-TS202S and ADSP-TS203S.
The TigerSHARC Processor has two integer ALUs (IALUs) that provide powerful address generation capabilities and perform many general-purpose integer operations. Each IALU has a multi-ported 31-word register file. As address generators, the IALUs perform immediate or indirect (pre- and post-modify) addressing. They perform modulus and bit-reverse operations with no constraints placed on memory addresses for data buffer placement. Each IALU can specify either a single, dual- or quad- word access from memory.
The TigerSHARC Processor IALUs enable implementation of circular buffers in hardware. Circular buffers facilitate efficient programming of delay lines and other data structures required in digital signal processing, and they are commonly used in digital filters and Fourier transforms. Each IALU provides registers for four circular buffers, so applications can set up a total of eight circular buffers. The IALUs handle address pointer wraparound automatically, reducing overhead, increasing performance, and simplifying implementation. Circular buffers can start and end at any memory location. Because the IALU's computational pipeline is one cycle deep, in most cases integer results are available in the next cycle. Hardware (register dependency check) causes a stall if a result is unavailable in a given cycle.
The TigerSHARC Processor IALUs enable implementation of circular buffers in hardware. Circular buffers facilitate efficient programming of delay lines and other data structures required in digital signal processing, and they are commonly used in digital filters and Fourier transforms. Each IALU provides registers for four circular buffers, so applications can set up a total of eight circular buffers. The IALUs handle address pointer wraparound automatically, reducing overhead, increasing performance, and simplifying implementation. Circular buffers can start and end at any memory location. Because the IALU's computational pipeline is one cycle deep, in most cases integer results are available in the next cycle. Hardware (register dependency check) causes a stall if a result is unavailable in a given cycle.
The TigerSHARC Processor Program Sequencer manages program structure and program flow by supplying addresses to memory for instruction fetches. Contained within the Program Sequencer, the Instruction Alignment Buffer (IAB) caches up to five fetched instruction lines waiting to execute. The Program Sequencer extracts an instruction line from the IAB and distributes it to the appropriate core component for execution. Other Program Sequencer functions include; determining flow according to instructions such as JUMP, CALL, RTI and RTS, decrement the loop counters, handle hardware interrupts and using branch prediction and 128-entry Branch Target Buffer (BTB) to reduce branch delays for efficient execution of conditional and unconditional branch instructions.
The ADSP-TS20xS family has three memory variants. The ADSP-TS201S has 24Mbits of on-chip embedded DRAM memory, divided into six blocks of 4Mbits (128 K words X 32-bits); the ADSP-TS202S has 12Mbits of on-chip embedded DRAM memory, divided into six blocks of 2Mbits (64 K words X 32-bits); the ADSP-TS203S has 4Mbits of on-chip embedded DRAM memory, divided into four blocks of 1Mbit (16 K words X 32-bits). On all variants, each block can store program memory, data memory or both, so programmers can configure the memory to suit their specific needs. The six memory blocks connect to the four 128-bit wide internal buses through a crossbar connection, enabling four memory transfers in the same cycle. The internal bus architecture of the ADSP-TS20xS family provides a total memory bandwidth of 32 Gbytes/second, enabling the core and I/O to access twelve 32-bit data words four 32-bit instructions per cycle.
The TigerSHARC Processor on-chip DMA controller, with fourteen DMA channels, provides zero-overhead data transfers without processor intervention. The DMA controller operates independently and invisibly to the DSP's core, enabling DMA operations to occur while the core continues to execute program instructions.
The DMA controller performs routine functions such as external port block transfers, link port transfers and AutoDMA transfers as well as additional features such as Flyby transfers, DMA chaining and Two-dimensional transfers.
The DMA controller performs routine functions such as external port block transfers, link port transfers and AutoDMA transfers as well as additional features such as Flyby transfers, DMA chaining and Two-dimensional transfers.
The ADSP-TS201S and ADSP-TS202S have four full-duplex link ports each providing four-bit receive and four-bit transmit I/O capability, using Low-Voltage, Differential-Signal (LVDS) technology. With the ability to operate at a double data rate running at 500 MHz, each link can support up to 500 Mbytes per second per direction, for a combined maximum throughput of 4 Gbytes per second.
The ADSP-TS203S has two full-duplex link ports each providing four-bit receive and four-bit transmit I/O capability, using Low-Voltage, Differential-Signal (LVDS) technology. With the ability to operate at a double data rate running at 250 MHz, each link can support up to 500 Mbytes per second per direction, for a combined maximum throughput of 4 Gbytes per second.
Each Link Port has its own triple-buffered quad-word input and double-buffered quad-word output registers. The DSP's core can write directly to a Link Port's transmit register and read from a receive register, or the DMA controller can perform DMA transfers through eight dedicated Link Port DMA channels.
The ADSP-TS203S has two full-duplex link ports each providing four-bit receive and four-bit transmit I/O capability, using Low-Voltage, Differential-Signal (LVDS) technology. With the ability to operate at a double data rate running at 250 MHz, each link can support up to 500 Mbytes per second per direction, for a combined maximum throughput of 4 Gbytes per second.
Each Link Port has its own triple-buffered quad-word input and double-buffered quad-word output registers. The DSP's core can write directly to a Link Port's transmit register and read from a receive register, or the DMA controller can perform DMA transfers through eight dedicated Link Port DMA channels.
The external port on TigerSHARC Processor is 64 bits wide and runs up to 125MHz. Using the external port, up to 8 TigerSHARC Processor's, a host and global memory can be shared without any external logic. This is the second way, in addition to link ports, that TigerSHARC DSP offers support for multiprocessor systems. SDRAM and SBSRAM controllers allow for a glueless interface to these types of memories. The external port also supports a fly by mode which allows a host to access a global shared memory.
Analog Device's TigerSHARC Processor architecture is designed for multi-processor applications that require ultra-high performance signal processing while offering unmatched user flexibility, scalability and on-chip peripheral integration.