Advanced Technical Paper :
"Design Patterns for High Availability"
High availability systems are those able to tolerate both expected and unexpected faults. Their design is based on a
combination of redundant hardware components and software to manage fault detection and correction. This must
be done without human intervention to achieve “five-nines” (99.999%) or greater availability, equivalent to less than 1
second of downtime per day. After a quick presentation of definitions relevant to high availability and fault
management, basic hardware N-plexing and voting issues are discussed. This is followed by an in-depth discussion
of software fault tolerance techniques appropriate for embedded systems, starting with the static method of N-version
programming. A number of dynamic software fault tolerance techniques are then surveyed, including Checkpoint-
Rollback, Process Pairs and Recovery Blocks. The discussion ends with a forward error recovery technique called
Alternative Processing. Many real-world examples are presented.
When we want to design a high availability system, we need to focus a lot of our design effort on issues of failures and
For want of a better definition, we can define a “failure” as a situation in which the service provided by a system is not
according to its specification. I’d prefer a definition in which we’d compare the service provided by a system with our
expectations of the system. But often our expectations are subjective and not well documented. So we’ll stick with a
comparison of a system’s services to its specification, even though that leaves us open to problems arising from
specification errors or omissions.
A “fault”, on the other hand, is a failure of an interacting system. It’s what we think of as the possible “cause” of
something (undesirable) happening. So a fault might be a failure of a sub-system of our system, or a component
failure, or a failure of an external system, or a programming mistake. Faults can trigger more faults of various kinds.
Or they can trigger failures. Or they can happen without triggering any failure at all.
For example, a fault might be the failure of a trucker to respect the load limit of the Golden Gate Bridge. If the truck
exceeds the load limit by 20 or 30 percent, we don’t expect the bridge to fail. But another fault might be that same truck
crashing into the median divider of the bridge roadway. This wouldn’t cause the bridge to physically fail either, but it
might cause the bridge to fail in its specified purpose of bringing people and vehicles across the Golden Gate
waterway for a number of hours.
Faults may be transient, permanent or intermittent. When they are activated, they may cause errors in the state of a
system or sub-system. And it is these errors that may trigger failures of your system. There are 4 main approaches to
dealing with faults:
* fault forecasting
* fault avoidance
* fault removal
* fault masking.
Fault avoidance and removal are implemented by using rigorous system, hardware and software development
processes, perhaps including formal specification and verification techniques. Fault masking, or “fault tolerance” is
implemented by using a redundant, perhaps diverse implementation of a system to avoid the effects of faults. One
way to implement fault tolerance is called “Graceful Degradation” (or “Fail Soft”) --- providing sensible partial
functionality if full system performance cannot be provided. Another technique is called “Fail Safe” (or “Fail Stop”) ---
stopping the system in a safe state rather than continuing operation when a fault appears.
The main technique for fault tolerance is redundancy. It’s based on the idea (or the hope) that multiple independent
faults will not strike your system together. A fault tolerant system should be designed to avoid single points-of-failure.
In other words, if a part of the system can suffer a fault, there should be a redundant part of the system that can cover
for the fault and thus avoid a failure.
Redundancy comes in a variety of forms:
* hardware redundancy (low-level, high-level, or both)
* software redundancy
* time redundancy
* information redundancy.
Examples of hardware redundancy include self-checking logic circuits, and multiple flight computers in a single
airplane. Software redundancy might use 2 different algorithms to calculate the same result. Time redundancy may
be done by communication re-transmissions. And examples of information redundancy include backups, checksums
and error correction codes.
Redundancy may be either dynamic or static. Both use replicated system elements. In static redundancy, all replicas
are active at the same time: If one replica “throws a fault”, the other replicas can be used immediately to allow the
system to continue correct operation. In dynamic redundancy, one replica is “active” and others are “passive”: If the
“active” replica “throws a fault”, a previously “passive” replica is activated and takes over critical operations.
So what does all of this have to do with High Availability? First a definition: “High Availability” is the ability of a system
to tolerate faults and continue to provide service as specified. Systems can use all of the approaches and techniques
described above to achieve High Availability. Availability is often measured in units of “availability percentage” or
“down time per year”. Typical “fault tolerant” systems may be expected to be available 99.99% of the time, or be
“down” less than an hour per year. But “high availability” systems are often expected to be available 99.999% of the
time (“five-nines”), or be down less than 5 minutes per year or roughly 1 second per day. This generally means that
when a fault appears, it must be handled fully automatically – humans are too slow to react to remove or mask any
problem within the time required.
HARDWARE FAULT TOLERANCE
Rather than building hardware as individually super-reliable modules constructed of super-reliable components, it is
usually much more cost-effective to use redundant replicated modules of ordinary quality hardware built with ordinary
Each of the replicas is often designed to exhibit “Fail Fast” or “Fail Stop” behavior. This means that the hardware
module will stop almost immediately, when it detects that it has a fault. This vastly simplifies fault management
decision making: Every failure will make the hardware stop dead-in-its-tracks; rather than attempting to “limp along”
and challenge the fault manager to figure out which outputs of the module are now faulty and which are still good.
For fault tolerance using static redundancy, each replica module can be of quite ordinary reliability. If two replicas are
used, this is called “pairing” or “Duplexing”. If N replicas are used, this is called “N-Plexing”.
Figure_1 (below) shows a 3-Plexed or “Triple Redundant” hardware design.
|Figure 1: 3-Plexed or “Triple Redundant” Hardware
The 3 replicas are shown near the bottom of the diagram. They provide their outputs to a “voter” which determines the
actual final output of this triply redundant subsystem. When N >3 in an N-Plexed design, a “voter” will usually use
majority decision making. However, this needs to be a majority of the non-failed replicas, not just a simple majority of
the total number of replicas (failed and non-failed).
An issue that is sometimes of concern is the question of voter failures. Isn’t a voter just a piece of hardware and
software that could fail just like any of the other modules in our system? In fact, it could; and it would bring disaster to
our system if it did. But voters are normally very simple units that can be designed and tested with great care to
ensure their robustness. [Or alternatively, it is possible to get into designs involving replicated voters and second-
level voters of voters, etc.]
For fault tolerance using dynamic redundancy, the replicated modules can still be of quite ordinary reliability. The
modules do not have to be precise replicates of one another, and can have different characteristics, interfaces and
capacities. Dynamic redundancy requires a “Failover Strategy” to decide how to manage multiple modules when the
“primary” module throws a fault. Here are some choices:
* Standby Backup: While the primary module is running in the system, a backup module is waiting in “standby”,
watching the primary module for faults and ready to fire up and take over.
* Rotating Standby: While the primary module is running in the system, there may be a number of backup modules.
One backup will take over running the system in case of a fault in the primary.
* Fallover to Non-Critical Module: Primary module is running the critical resources of the system. A backup module
can be running other non-critical things, but it can take over the most critical services of the primary in case of fault.
* Mutual Takeover: Each module runs its own critical resources, but can take over the critical resources of another
module in case of fault.
The correct implementation of failover is crucial. It would be a disaster if a primary module that was throwing faults
were to continue running, while at the same time another module would try to take over its services. Their services
could collide in unexpected ways. And it might equally be a disaster if a primary module were stopped after throwing a
fault, and then no other module came up to take over its services. So the validation and testing of failover is critical,
although it’s not something that most of us enjoy doing.
SOFTWARE FAULT TOLERANCE
Most hardware faults are random faults resulting from physical defects, either manufacturing defects or defects that
develop over time as components wear out or suffer shocks from the surrounding physical world. Software faults, on
the other hand, are not physical. Software does not wear out. Instead, software faults result from the invocation of
software paths that contain defects in the software design or implementation. Since software is often much more
complex than hardware, it can be expected to have many more built-in defects – resulting in many more software
faults than hardware faults. Software fault tolerance is also much more expensive to design than hardware fault
A veteran technique for software fault tolerance is something called “N-Version Programming”. In the 1970s when I
first ran across it, we called it “Dissimilar Software”. It’s the software analogy of hardware N-Plexing (see Figure_1
above). But it’s not as simple as the replication of N-Plexing, since if N copies of the same software were running they’
d simply contain the same software faults and produce the same software errors N times. So if N units of some
software functionality need to run in parallel, they need to be N disparate implementations of that functionality –
independently implemented by N separate development teams. That’s N-Version Programming.
Back in the 1970s we thought of N-version programming as the state-of-the-art in software fault tolerance. But since
then, experience has exposed a number of problems with this technique: Software development costs skyrocket,
since you need to pay the N independent groups to implement the N independent software designs. But if you try to
skimp on some of the costs, you’ll run into what’s called the “Average IQ” problem: Less expensive development
teams contain less-qualified software engineers that produce lower-quality code. So you may end up with N diverse
programs that are all riddled with faults, created in N different ways.
Another issue in N-version programming is the question of what to provide as input to the N independent
development teams. In general, what is done is that a single specification is photocopied and provided to all the
development teams. But if this specification is flawed, you’ll get N independently developed versions of similarly
flawed software that all do the wrong thing. And if specification or usage errors crop up after system deployment,
each new error must be fixed N times – once in each of the N diverse implementations, thus making maintenance
costs exorbitant. Nowadays, the pricetag of N-version programming is usually thought to be better spent by asking
one top-notch software development team to develop one high-quality software version using the best available
infrastructure, software development tools, techniques and testing.
In contrast to the static redundancy of N-version programming, there are a number of software fault tolerance
techniques based on dynamic redundancy. All of them take a 4-stage approach:
1. Error Detection
2. Damage Assessment and Confinement (sometimes called “Firewalling”)
3. Error Recovery
4. Fault Treatment and Continued Service.
In the second of these stages, a “Fail Stop” attitude is often invoked when software errors are detected. But software
is highly complex, so it is often unclear how to “un-do” the effects of faulty software behavior surrounding an error.
One very helpful tool in this regard is the concept of a transaction. A transaction is a collection of operations on the
state of an application, such that the beginning of a transaction and the end of a transaction are points at which the
application is in a consistent state.
If we are to use the concept of transactions for fault tolerance, our system must be able to save its state at the
beginning of a transaction. This is called “Checkpointing”. It involves taking a “snapshot” of the software’s situation
just as it is about to begin the first step of the next transaction. The snapshot is only taken if the previous transaction
was completed in a fault-free state. The basic recovery strategy here is re-execution: When an error is detected
during a transaction, the transaction is “Fail Stopped” and then the system is reloaded back to the latest saved
checkpoint. Then service is continued from this checkpoint, allowing new transactions to build upon its consistent
state. However, the transaction that was “Fail Stopped” is lost. This kind of error recovery is referred to as “Backward
Error Recovery”, since the software state is rolled back to a past error-free point.
Simple checkpointing has its own dangerous single point-of-failure: There might be a fault during the taking of the
“snapshot” of the checkpoint itself. But there’s a solution to this, sometimes called “Checkpoint-Rollback”. It’s
illustrated in Figure_2.
|Figure 2: Checkpoint-Rollback Scenario
In this figure, the ellipses represent a software client and a software server that communicate with each other by
sending messages through queues. There may be many Request messages and many Response messages in a
single transaction. During a transaction, data are changing within the server. At the end of a transaction, a consistent
set of data should be recorded on each of the two persistent mass-storage devices shown on the right. Together with
the data, a transaction sequence number should be recorded. If an error is detected sometime later and the server
software is stopped, this server is restarted (or a replica server is started). As part of the recovery procedure, the
transaction sequence numbers on the two mass-storage devices are checked. Recovery of server data is done from
the device containing the highest sequence number. [The other mass-storage device may contain a lower sequence
number because there was a fault during the checkpointing to that device.]
A limitation of this checkpointing approach, is that recovery after a fault may take quite a long time. There may be a lot
of processing involved in starting or restarting a server, and in recovering the data back to the checkpoint. What could
speed this up is a “hot backup” server working directly with persistent mass storage of its own. This approach is
called “Process Pairs”, and is illustrated in Figure_3.
Here we see a “primary” server at the center of the diagram working very much like in the previous checkpointing
scenario. Clients work directly with this primary server. But whenever the primary server successfully completes an
entire transaction, it passes information about its new consistent state to a “backup” server (on the right). Both the
primary and backup servers record these data in their persistent mass-storage devices. In this way, the backup
server is kept current about completed transactions. While the primary server is available to clients, it sends regular
“heartbeat” messages to the backup server. These heartbeat messages may be combined with some of the
checkpoint messages. If the backup server detects that the stream of heartbeat messages has stopped, it
understands that the primary server is dead or unavailable, and it will very quickly take over as a new primary server.
With all its sophistication, the “Process Pairs” approach still leaves some issues open. For example, checkpointing
messages may get lost or arrive at the backup server out of order. Or … what happens if a primary server is the
controller of a physical device and fails in mid-operation? How will the backup server complete the operation when it
takes over? [Example: Primary server has moved the graphite rods of a nuclear reactor 10 cm. of the required 30 cm.
when it fails.]
Once again, in the process pairs approach the transaction that was underway when the primary server failed, may
well be lost or dropped in mid-execution. The backup server will take over in the state of the last previously completed
transaction reported to it by the primary server before it failed.
Another approach to dynamic software redundancy is called “Recovery Blocks”. It is also based on the concept of
checkpointing, but it adds the new idea of an Acceptance Test for the results of software processing. You need to
prepare a number of alternative ways of processing your data, as in N-version programming; but you don’t run all of
your processing alternatives for each transaction. Instead, you select one way of processing your data as your primary
alternative, and first try to do the transaction using the primary alternative’s processing only. At the end of this primary
processing, you run the Acceptance Test. If the Acceptance Test is passed, you’re done! If the Acceptance Test is
failed, you go on to try the next alternative processing, as shown in Figure_4.
|Figure 4: Recovery Blocks
As seen in this diagram, checkpointing must be done before first entering the recovery block. And the software must
“roll back” to the checkpoint state after every failed Acceptance Test. The Acceptance Test is performed after every
processing alternative that is tried. In this way, processing alternatives are run until a processing alternative
succeeds in delivering results that pass the Acceptance Test. These are the “good” outputs that comprise the final
results of the Recovery Block.
Checkpoint-Rollback, Process Pairs and Recovery Blocks are all ways of doing backward error recovery. But
“Forward Error Recovery” is also an option for dynamic software redundancy. The idea here is to continue from an
erroneous state and make corrections that will clean up the damaged state. Forward error recovery depends on
accurate damage assessment and is often system-specific. An example of forward error recovery is Exception
Handling, which passes control to specialized exception handling software when a problem is detected, rather than
rolling back and continuing from some previously checkpointed state.
One technique for forward error recovery is called “Alternative Processing”. This approach can be used when there
are two (or more) processing alternatives for a transaction, with one alternative normally being very precise but
computationally complex, and the other alternative being more simplistic and of higher performance. The second
alternative can be used as both a test and a secondary results provider, if the primary processing is faulty. For
example, a QuickSort algorithm can be the primary processing and a BubbleSort be its alternative. Once QuickSort
has run, BubbleSort can be used both to test the QuickSort results and to quickly fix some QuickSort errors. A second
example is shown in Figure_5.
|Figure 5: Alternative Processing for Aircraft Flight Control
In this diagram, we see two digital flight control systems being run at the same time to control a single airplane (777).
The decision logic on the right side of the diagram uses the outputs of the simpler flight control system as a ‘yardstick’
against which to test the outputs of the complex primary flight control system. If the acceptance test fails, then the
outputs of the simpler flight control system could also be used as an alternative to the failed primary outputs. [The
arrows going to the left show that the results of the alternative processing may also be used to provide feedback to the
It is only techniques such as these that will allow normal commercial-quality hardware and software to be used as
building blocks for truly high-availability systems – systems that can, without human intervention, achieve “five-nines”
(99.999%) or greater availability, equivalent to less than 1 second of downtime per day over years of operation.
This paper has been a short introduction to high availability embedded systems design issues.
|An earlier version of this material was published in the August 2002 issue of Embedded Systems Programming magazine.
© Copyright 2016, D. Kalinsky Associates, All Rights Reserved.
This page Updated March 25, 2016