Margining and Calibration for Fun and Profit

Abstract

This application note presents an overview of electronic margining and its value in detecting potential system failures before a product ships from the factory. It is a calibration method that effectively predicts and allows adjustments to improve product quality. Margining also can be used to sort products into performance levels, allowing premium products to be sold at premium prices. We discuss the downside of sorting and suggest alternative ways to segregate products.

Introduction

The word "margin" has multiple meanings. A common type of margin, the space around the printed text on a page, can be seen on clay tablets from over 5000 years ago. One of the scariest definitions of the word relates to buying stock on margin (borrowing money from a broker using the stock purchase as collateral). Prior to the 1929 U.S. Stock Market Crash, one could borrow up to 90% of the stock's value. If the stock's value dropped, a "margin call" required one to pay money to retain stock ownership. Not meeting the margin call means the stock would be sold and the investor might lose all his money. Today, margin buying is limited to a much smaller percentage.

We, however, will concentrate on margining in the computer and electronics industries. When the first microprocessors were mounted on motherboards to make computers, many, notably gamers and modders (modifiers), wanted faster speeds. Thus, "speed margining" was born.

Speed margining, also known as "pushing" or "overclocking," is changing your computer's system hardware settings to operate at a speed higher than the manufacturer's rating. This can be done at various points in the system, usually by adjusting the speed of the CPU, memory, video card, or motherboard bus speed.

Often, chips are trimmed and tested by the manufacturer to determine at what speed they fail. They are then rated at a speed one step lower than this. IC manufacturers will try to make their product as fast as possible, because faster hardware sells for more money. Statistically, some ICs in a wafer may be able to run at higher speeds than others. Each chip is tested to see how fast it will run, and the ones that run faster are labeled as "higher speed." Because the tests are quite rigorous and conservative (because parts must be guaranteed to operate at a minimum speed), gamers thought it would be possible to push the CPU slightly faster than its rating, while preserving stability in the system.

Overclockers also figured out that some IC manufacturers deliberately underrated chips in order to meet market demand and create differentiation between high-end and low-end products. Occasionally, when manufacturers are short on stock, they package faster chips as slower ones to meet demand. Does overclocking always work? No. However, overclockers try because statistically some succeed.

The Next Step in Margining

As we discussed in the "Why Digital Is Analog" section of application note 4345, "Well Grounded, Digital Is Analog," digital signals are more tolerant of noise and power-supply levels than analog signals are. This is because of the thresholds inherent in digital devices. Analog devices are immediately corrupted, while a digital device generally functions, as long as the signal is higher or lower than the critical threshold levels. In a digital system, failure is sudden (the cliff effect), because deteriorations in the signal are initially rejected (by the thresholds) until they grow serious enough to corrupt data. There, it is necessary to test for performance margin to guarantee that the product will operate over its warranty lifetime and in extreme conditions.

Margining is a long-established technique used in the computer industry to prove the reliability of a digital process, by testing it under conditions more stressful than will be encountered in normal service. The degree of additional stress that can be applied before failure is a measure of the performance margin.

Margining techniques can be applied with great success, and if used correctly, will give a much-needed confidence factor. There are two typical ways to margin a circuit: one is to use pathological data patterns. Depending on the system and coding used, data patterns with long strings of ones or zeros lack clock data and may stress the threshold requirements. The second and more universal method is to change the power-supply level (usually by reducing the voltage).

Calibrating Margin

During the design phase of a system or PCB, designers must define a test protocol. That procedure will detail how to quantify the stress applied to a device under test (DUT), and how to either objectively sort the DUT into various performance levels or fail the product (Table 1). The protocol should weed out the early infant failures and provide confidence that the DUT will perform over its projected lifetime. A typical protocol would reduce the power-supply voltage in steps, possibly changing other parameters such as clock speed, and monitor some key functional parameters. Limits would be set, for example, to sort the DUT into high- to low-performance units and to remove the failed devices.

Sample Procedure for Placing DUT into Bins A, B, C, D, E, or Fail

Implement a software or firmware procedure that can adjust the power-supply voltage and change the clock speed, while alternately writing one of two data patterns into memory. After each memory write, read the memory to monitor for errors.
In the following binning operations, calibration is performed to precisely set the power-supply voltage. Other calibration procedures would also be completed to compensate for component tolerances and to ensure all other specifications are met.

Table 1. Sample Criteria to Sort DUTs
Bin	UT Criteria for Bin Placement
A	Meets specifications with power supplies 10% high and clock speed "A"% high.
B	Meets specifications with power supplies 10% high and clock speed "B"% high.
C	Meets specifications with both power supplies and clock speed nominal.
D	Meets specifications with power supplies 10% high and clock speed "D"% low.
E	Meets specifications with power supplies 10% high and clock speed "A"% low.
F (Fail)	Fails to meet specifications.

The failed parts can be repaired or discarded, based upon economic factors. Be careful of how the failed boards are discarded; they must be physically destroyed to prevent unscrupulous persons from selling them on the gray market and hurting the reputation of the legitimate manufacturer.

Other Ways to Use Margining and Calibration

Other system specifications can be satisfied using a combination of margining and calibration. The parameters that can be improved include power dissipation and power-supply noise rejection (see Figure 1).

Figure 1. A power-supply decoupling and monitoring system.

Engineering seems to always involve a trade-off between certain advantages and disadvantages. Linear (low dropout) regulators are quiet, but the excess power is turned into heat. Switching regulators are efficient with power, but tend to be noisy. Figure 1 uses calibration to get the best of both worlds. Power regulators typically have tolerances in the 5% to 10% range. Now, envision a system with margining that must minimize power dissipation. The output of both regulators goes through separate lowpass filters or decoupling to minimize noise. Most power-supply regulators are optimized for DC and have just enough frequency response to respond to line and load changes. They operate as feedback loops comparing the output voltage with a reference. The typical frequency response is limited to a few tens to a couple of hundred kilohertz to prevent oscillation.

The MAX11600 family has an input multiplexer, followed by an analog-to-digital converter (ADC), to monitor as many as many as 12 separate points. To minimize power dissipation, we measure point A, which accounts for the voltage drop across the switcher decoupling, by setting the digital potentiometer (pot) on the switcher. The Point B measurement compensates for the voltage drop across the LDO decoupling network, by adjusting the digital pot on the LDO. By setting the switcher just higher than the required LDO voltage, we can achieve relatively quiet power with minimum power loss. To further reduce noise, the LDO could be replaced with a voltage reference, or a voltage reference could be added to the LDO output to power a critical circuit, such as a low-noise amplifier.

The digital pots shown in Figure 1 could also be replaced by a digital-to-analog converter (DAC). For more ideas, please see the following application notes:

Application note 4711, "Digital Calibration Makes Automated Test Easy; Calibration FAQs"
Application note 4704, "Introduction to Electronic Calibration and Methods for Correcting Manufacturing Tolerances in Industrial Equipment Designs"
Application note 4703, "Introduction to Electronic Calibration and Methods for Correcting Manufacturing Tolerances in or Medical Equipment Designs"
Application note 4300, "Calculating the Error Budget in Precision Digital-to-Analog Converter (DAC) Applications"
Application note 4003, "Series or Shunt Voltage Reference?"
Application note 226, "Step-Up DC-DC Converter Calibration and Adjustment Using a Digital Potentiometer"

Margining has clear advantages, and allows potential failures to be detected and corrected before the product is shipped. But with all things engineering, there are dangers. There is a risk that over the process and component tolerance range, a given batch may not yield the highest grade. Thus, one can be stuck with a large number of low-level units in stock. Sorting products into performance grades and pricing them accordingly is problematic, but as a testing tool, testing it is much safer. The issue is statistics.

Careful statistical design is necessary to protect against the low-yield trap. Examples of tools that can help the design can be found at www.analog.com. They include analog design calculators for HP^® 50g, including a free emulator that runs on a personal computer. The Statistical Process Control Calculator aids in the prediction and analysis of process yield. It calculates defects per million, yield, process shift, standard deviation, mean, and the lower- and upper-specification limits. Micro-Cap 10 is a circuit simulator from Spectrum Software (there is a free evaluation version). This software allows resistor values to be swept and Monte Carlo analysis to be performed to explore the effects of component tolerances.

As the number of variables (components) increase, it may become impractical to "guarantee" low-yield protection. Then, another calibration technique could be employed. Let's assume a product has several areas of calibration. We will call those areas A, B, and C (any number of areas can be accommodated). The highest-performance (price) device is fully calibrated in all areas. The midrange product is calibrated in areas A and C. The least-expensive product may be calibrated only in area C and the adjustment device (digital pot or DAC) is replaced with a fixed resistor in areas A and B. This creates real product differentiation by reducing test and component costs in the lower-precision products.

Conclusion

Margining can effectively detect early failure in systems and provide a way to differentiate product-performance levels. Although margin calls in the stock market may be more risky, it is important to also protect yourself against the low-yield trap of electronic margining.