Volume 40 April 2006
Design Your Own VoIP Solution with a Blackfin® Processor— Add Enhancements Later
By David
Katz, (david.katz@analog.com)
The VoIP challenge to the embedded-system designer is to choose a processing solution that is cost-effective, easy to deploy, and scalable in performance across market spaces. A "sweet-spot" embedded-solution approach is to design with a platform that can implement a low-channel-count basic VoIP solution, yet retain plenty of capacity for value-added capabilities and services—like video, music, imaging, and system control. The discussion below makes the case that the Blackfin processor family from Analog Devices offers just such an attractive solution. What Is VoIP? The VoIP alternative uses Internet protocol (IP) to send digitized voice traffic over the Internet or private networks. An IP packet consists of a train of digits containing a control header and a data payload. The header provides network navigation information for the packet, and the payload contains the compressed voice data. While circuit-switched telephony deals with the entire message, VoIP-based data transmission is packet-based, so that chunks of data are packetized (separated into units for transmission), compressed, and sent across the network—and eventually re-assembled at the designated receiving end. The key point is that there is no need for a dedicated link between transmitter and receiver. Packetization is a good match for transporting data (for example, a JPEG file or email) across a network, because the delivery falls into a non-time-critical "best-effort" category. The network efficiently moves data from multiple sources across the same medium. For voice applications, however, "best-effort" is not adequate, because variable-length delays as the packets make their way across the network can degrade the quality of the decoded audio signal at the receiving end. For this reason, VoIP protocols, via QoS (quality-of-service) techniques, focus on managing network bandwidth to prevent delays from degrading voice quality. Packetizing voice data involves adding header and trailer information to the data blocks. Packetization overhead (additional time and data introduced by this process) must be reduced to minimize added latencies (time delays through the system). Therefore, the process must achieve a balance between minimizing transmission delay and using network bandwidth most efficiently—smaller size allows packets to be sent more often, while larger packets take longer to compose. On the other hand, larger packets amortize the header and trailer information across a bigger chunk of voice data, so they use network bandwidth more efficiently than do smaller packets. By their nature, networks cause the rate of data transmission to vary quite a bit. This variation, known as jitter, is removed by buffering the packets long enough to ensure that the slowest packets arrive in time to be decoded in the correct sequence. Naturally, a larger jitter buffer contributes to more overall system latency. As mentioned above, latency represents the time delay through the IP system. A one-way latency is the time from when a word is spoken to when the person on the other end of the call hears it. Round-trip latency is simply the sum of the two one-way latencies. The lower the latency value, the more natural a conversation will sound. For the PSTN phone system in North America, the round-trip latency is less than 150 ms. For VoIP systems, a one-way latency of up to 200 ms is considered acceptable.
The largest contributors to latency in a VoIP system are the network
and the gateways at either end of the call. The
voice codec (coder-decoder) adds some latency—but this
is usually small by comparison When the delay is large in a voice network application, the main challenges are to cancel echoes and eliminate overlap. Echo cancellation directly affects perceived quality; it becomes important when the round-trip delay exceeds 50 ms. Voice overlap becomes a concern when the one-way latency is more than 200 ms. Because most of the time elapsed during a voice conversation is "dead time"—during which no speaker is talking—codecs take advantage of this silence by not transmitting any data during these intervals. Such "silence compression" techniques detect voice activity and stop transmitting data when there is no voice activity, instead generating "comfort" noise to ensure that the line does not appear dead when no one is talking. In a standard PSTN telephone system, echoes that degrade perceived quality can happen for a variety of reasons. The two most common causes are impedance mismatches in the circuit-switched network ("line echo") and acoustic coupling between the microphone and speaker in a telephone ("acoustic echo"). Line echoes are common when there is a two-wire-to-four-wire conversion in the network (e.g., where analog signaling is converted into a T1 system). Because VoIP systems can link to the PSTN, they must be able to deal with line echo, and IP phones can also fall victim to acoustic echo. Echo cancellers can be optimized to operate on line echo, acoustic echo, or both. The effectiveness of the cancellation depends directly on the quality of the algorithm used. An important parameter for an echo canceller is the length of the packet on which it operates. Put simply, the echo canceller keeps a copy of the signal that was transmitted. For a given time after the signal is sent, it seeks to correlate and subtract the transmitted signal from the returning reflected signal—which is, of course, delayed and diminished in amplitude. To achieve effective cancellation, it usually suffices to use a standard correlation window size (e.g., 32 ms, 64 ms, or 128 ms), but larger sizes may be necessary. Emerging and Current VoIP-Based Applications VoIP users tend to think of their connection as being "free," since they can call anywhere in the world, as often as they want, for just pennies per minute. Although they are also paying a monthly fee to their Internet service provider, it can be amortized over both data and voice services. Besides the low cost relative to the circuit-switched domain, many new features of IP services become available. For instance, incoming phone calls on the PSTN can be automatically rerouted to a user's VoIP phone, as long as it's connected to a network node. This arrangement has clear advantages over a global-enabled cellphone, since there are no roaming charges involved—from the VoIP standpoint, the end user's location is irrelevant; it is simply seen as just another network-connection point. This is especially useful where wireless local-area networks (LANs) are available; IEEE-Standard-802.11-enabled VoIP handsets allow conversations at worldwide Wi-Fi hotspots without the need to worry about mismatched communications infrastructure and transmission standards. Everything discussed so far in relation to voice-over-IP extends to other forms of data-based communication as well. After all, once data is digitized and packetized, the nature of the content doesn't much matter, as long as it is appropriately encoded and decoded with adequate bandwidth. Because of this, the VoIP infrastructure facilitates an entirely new set of networked real-time applications, such as:
A CLOSER LOOK AT VoIP SYSTEM
Figure 1. (a) Simplified representation of possible
IP telephony network connections. The signaling process involves creating, maintaining, and terminating connections between nodes. In order to reduce network bandwidth requirements, audio and video are encoded before transmission and decoded during reception. This compression and conversion process is governed by various codec standards for both audio and video streams. The compressed packets move through the network governed by one or more transport protocols. A switching gateway ensures that the packet set is interoperable at the destination with another IP-based system or a PSTN system. At its final destination, the packet set is decoded and converted back to an audio/video signal, at which point it is played through the receiver's speakers and/or display unit. The OSI (Open Systems Interconnection) seven-layer model (Figure 2) specifies a framework for networking. If there are two parties to a communication session, data generated by each starts at the top, undergoing any required configuration and processing through the layers, and is finally delivered to the physical layer for transmission across the medium. At the destination, processing occurs in the reverse direction, until the packets are finally reassembled and the data is provided to the second user.
Session
Control: H.323 vs. SIP International Telecommunication Union (ITU) H.323 SIP (Session Initiation Protocol) SIP is used with SDP (Session Description Protocol) for user discovery; it provides feature negotiation and call management. SDP is essentially a format for describing initialization parameters for streaming media during session announcement and invitation. The SIP/SDP pair is somewhat analogous to the H.225/H.245 protocol set in the H.323 standard. SIP can be used in a system with only two endpoints and no server infrastructure. However, in a public network, special proxy and registrar servers are utilized for establishing connections. In such a setup, each client registers itself with a server, in order to allow callers to find it from anywhere on the Internet. TRANSPORT LAYER PROTOCOLS UDP (User Datagram Protocol) TCP (Transmission Control Protocol) TCP creates smaller packets that can be transmitted over the Internet and received by a TCP layer at the other end of the call, such that the packets are "reassembled" back into the original message. The IP layer interprets the address field of each packet so that it arrives at the correct destination. Unlike UDP, TCP does guarantee complete receipt of packets at the receiving end. However, it does this by allowing packet retransmission, which adds latencies that are not helpful for real-time data. For voice, a late packet due to retransmission is as bad as a lost packet. Because of this characteristic, TCP is usually not considered an appropriate transport for real-time streaming media transmission. Figure 2 shows how the TCP/IP Internet model, and its associated protocols, compares with and utilizes various layers of the OSI model.
Figure 2. Open Systems Interconnection and TCP/IP models. Media Transport RTP (Real-Time Transport Protocol)
Figure 3. Header structure and payload of an RTP frame. In order to maintain a given QoS level, RTP utilizes timestamps, sequence numbering, and delivery confirmation for each packet sent. It also supports a number of error-correction schemes for increased robustness, as well as some basic security options for encrypting packets. Figure 4 compares performance and reliability of UDP, RTP, and TCP.
Figure 4. Performance vs. reliability. RTCP (RTP Control Protocol) MEDIA CODECS A number of factors help determine how desirable a codec is—including how efficiently it makes use of available system bandwidth, how it handles packet loss, and what costs are associated with it, including intellectual-property royalties. BLACKFIN VoIP COLLATERAL As an example, the ADSP-BF537 Blackfin processor family provides the necessary degree of integration and performance, with low power consumption, for VoIP deployment. It features multiple integrated serial ports (for glueless connection to audio analog-to-digital (A/D) and digital-to-analog (D/A) converters), an external memory controller, a parallel peripheral interface (PPI) for LCD or video encoder/decoder connectivity, and a 10/100BaseT Ethernet MAC. If necessary, a second Ethernet MAC can be accommodated via the external memory interface. A complete communication channel—including voice and networking stack—uses less than 75 MIPS of the processing bandwidth. With ADSP-BF537 performance at up to 600 MHz, there is plenty of available processor "horsepower" to spread across a VoIP product portfolio, as features such as multimedia compression or decompression become necessary. In contrast, competing dedicated VoIP choices are typically performance-limited and offer little or no ability to add features or differentiation. For VoIP applications, Blackfin-based designs target high-quality, low-channel-count VoIP solutions—with processing headroom to accommodate added features such as music, video, and image transport, as well as overall system control. Here is a sampling of available VoIP offerings, ranging from open-source solutions to high-volume OEM reference designs: Blackfin/Linphone The main components used in the Blackfin Linphone reference design are: Linux TCP/IP networking stack: includes necessary transport and control protocols, such as TCP and UDP. Linphone: the main VoIP application, which includes Blackfin-based G.711 and GSM codec implementations. It comprises both a graphical user interface (GUI) for desktop PCs and a simple command-line application for nongraphical embedded systems. oRTP: an implementation of an RTP stack developed for Linphone and released under the LGPL license. oSIP: a thread-safe implementation of the SIP protocol released under the LGPL license. Speex: the open-source reference implementation of the Speex codec. Blackfin-specific optimizations to the fixed-point Speex implementation have been contributed back to the mainline code branch. Unicoi Systems Blackfin-Based Fusion Voice Gateway
Figure 5. Blackfin-based Fusion Voice Gateway from Unicoi Systems. The Fusion Voice Gateway features robust functionality, including G.168 echo cancellation and multiple G.7xx voice codecs. The Fusion reference design also includes full-featured telephony and router functionality by combining an Internet router, a 4-port Ethernet switch and VoIP gateway functionality. Unicoi Systems Blackfin-Based Fusion IP Phone The Fusion IP Phone Reference Design reduces BOM cost as well as the time and complexity often associated with developing an IP phone. Designed around the ADSP-BF536, the reference design software delivers the critical processing (e.g., real-time operating system, call manager, voice algorithms, acoustic echo cancellation for full-duplex speakerphone), communication protocols (TCP/IPv4/v6, SIP, RTP, etc.), and peripheral functions (LCD and keypad controllers, etc.) required to build a basic or advanced IP phone. Blackfin BRAVO VoIP Reference Designs For audio, the designs support multiple G.7xx audio codecs, G.168-compliant network echo cancellation, and acoustic echo cancellation for enhanced audio clarity. Optionally, RF transceivers can be included in the design to provide wireless audio capability. The designs support both H.323- and SIP-compliant software stacks. On the video front, the BRAVO Broadband Audio/Video Communications reference design (Figure 6) provides up to 30 frames per second of common intermediate format (CIF) color video, including support for ITU-standard H.263 and H.264 video codecs, picture-in-picture, high-resolution graphics with overlay, alpha and chroma keying, and antiflicker filtering.
Figure 6. Blackfin BRAVO Broadband Audio/Video Communications reference design, functional diagram.
CONCLUSION *To be exact, the task of session control and initiation lies in the domain of H.225.0 and H.245, which are part of the H.323 umbrella protocal. (Return to Text)
Copyright 1995- Analog Devices, Inc. All rights reserved. |