Fault Management Architecture

2018-01-23

I recently had some enquiries into using PMBus communication to manage faults vs. built-in fault management. The question was two fold: what is the impact of using PMBus to make fault-off decisions, and what can be done to build system wide fault logs?

Before I dig into these two uses of PMBus, let’s consider what is in the PMBus specification and the intent of the PMBus creators. The PMBus specification includes warnings, faults, and responses of individual devices:

There are two methods for communicating fault information: Alert Response Address (ARA) and Host Notify Protocol (HNP). ARA begins with the assertion of ALERTB to interrupt the board controller, which responds with PMBus address queries to make a list of all devices asserting ALERTB. HNP is initiated by the device, which becomes a PMBus master and sends STATUS_WORD directly to the board controller. In practice, the device responds first, and notifies the board controller afterwards. This protects the device and load by ensured the fastest possible response to a fault, mainly stopping power transfer.

Two additional areas of concern remain unaddressed by PMBus:

Interactions between devices.
Fault logs

Both of these concerns were intentionally left undefined because the PMBus committee felt these functions were better left to vendor innovation.

Of course, there are different ways to handle these functions: by using PMBus and a board controller, or using built-in features. Which is more or less the basis of the enquiries.

Let’s get started…

I built a prototype using my board controller reference design, which is based on a multi-threaded RTOS. This gave me practical results, rather than best case calculations, which may not be realizable in practice.

For hardware, I used a Freescale Kinetis K60 with Pseudo Static Ram (PSRAM), and Ferrite Ram (FRAM). PSRAM was a matter of convenience: I already had drivers. FRAM was used because it has the property that data is committed by the last clock of its write transaction, it does not require writing in blocks, and the number of program cycles before wear out is very large. For PMBus devices, I used an LTC3880, LTC2974, and LTC2977. I put a current load box on V_OUT0 of the LTC3880 to cause a fault.

Telemetry runs in its own thread, the fault handling in another thread, and there are some utility threads that are lower priority.

The application works roughly as follows:

ALERT/ is asserted from an over current
ARA is performed to get an address
STATUS_WORD is read
Power off decisions are made and executed
STATUS_WORD is stored in FRAM
Output current is read from all 13 rails
Output current is stored in FRAM
A retry timer is set
Retry is executed

This is an approximation because if multiple rails fault at the same time, more status is stored in FRAM. This is very common because over current can cause under voltage, and rails might interact.

This scope capture shows the results. You can see the telemetry on the I2C bus taking about 40ms for all 13 rails, and the results are stored on a SD card in 200ms. But later telemetry takes 300ms. (Look at “Store to SD Card”) This is why SD Card is good for telemetry but not for a fault log. The reasons are complex, but keep in mind the SD Card has a FAT file system so the number of operations includes reading directory structures, etc.

You can see multiple assertions on the ALERT/ pin and the total fault processing time is about 50ms. This is the time to perform multiple ARAs, read status from multiple rails, gather some output current readings, and store to FRAM. The fault triggers a sequence off which takes more than 400ms. Eventually there is a retry.

There is some good news and bad news. The good news is a 13 rail system log with multiple faults can store fault data in 50ms. This is close to a worst-case number. Typical faults can gather and store data in less than 10ms. If you look closely at the fault from the retry, you can see several FRAM transactions as the ARA is performed that are very fast. In that scenario, the original fault was captured in several milliseconds.

But now the bad news: the time it took to shut off the rails was hundreds of milliseconds. Ok, I know, that is not typical of real systems. I just wanted you to see how slow you can make the power shut off if you are not thinking about how your PMBus fault response is coded.

By switching to immediate off, notice how much faster the rails power off. Let’s zoom in a bit:

With immediate off, it took 2.5ms to power off the rails. The time is a combination of reading status registers, sharing the bus with telemetry, and commanding off the rails. Therefore, this number will move around, sometimes it will be fast, and sometimes it will be slow. The best possible case would be an ARA followed by a status read followed by a global off command. That is a read byte (3 bytes), read status word (5 bytes), global off (6 bytes). At 400kHz that is 375us. But this does not include any driver overhead. Note: three rails are ramping down very slow because they only have a few milliamperes load. You can kill power fast, but you need a load to yank it to ground. But that is another topic. This is much better, but can we do even better? Of course: if you use the built-in fault management of the devices. Let’s see what that can do.

The rail is off in 30us, and down to ground in less than 100us. This is a very lightly loaded rail: less than 1A. If I used a 20A load it would have been off much faster. This short delay does not have to compete with other activities: it is simply a comparator. Your code is free to do as it pleases having no effect on this built-in fault response. So what is the take away? Using the PMBus for telemetry and system wide fault logging makes sense. You can gather system wide data and put it in non-volatile memory quickly enough. This can add value on top of built in fault logging found in most devices. Typically the built in log will have more details and better information about the source of the original fault. The external logger will have global time stamped information about the whole system. With both logs, you have the best chance of diagnosis. However, using a serial bus to protect a load is not a good idea. The theoretical best case for a 400Khz serial bus is 10X slower than the built in solution. Let’s look at the problem a different way. Suppose a serial bus had to power off a rail in 30uS, how fast would its clock have to be? Using the ideal case of 14 bytes, that equals 112 bits. Adding a little time for interrupt latency and/or decision logic, that is about 4Mhz. Now consider what happens if there are 10 devices on the bus faulting at the same time. That requires 40Mhz. Now build a 100 rail system… In both cases, load protection and fault logging, the PMBus is functionally capable of crafting a response. But in the case of logs, it is best as an enhancement to built in logs, and in the case of load protection, it is best to leverage the device’s ability to share fault information. This is exactly what the original PMBus committee intended. To create a shared standard that solved common problems, yet supported innovation.