Advanced Technical Paper :
"Architecture of Safety Critical Systems"
Safety critical systems are embedded systems that, in cases of errors or failures, could cause injury or loss of human
life. Systems such as flight control, automotive drive-by-wire, nuclear reactor management, or operating room heart-
lung-bypass machines naturally come to mind. But devices as common as the power windows in your car are also
safety-critical, as soon as you imagine a small child reaching out of the car window at a fast food drive-thru to get
another packet of ketchup and accidentally leaning on the control switch making the window shut on the child’s arm,
Small system defects or situations can cascade into life-threatening failures very quickly, as shown in Figure 1.
|Figure 1: Fault-Error-Failure Cascade can lead to Life-Threatening Hazards
Faults are defects or situations that can lead to failures. They may be quite small, such as a frozen memory bit, an
uninitialized variable in software, or a cosmic ray ionizing its way through our embedded system. A fault may (or may
not) lead to an error. An error is a manifestation of a fault as an unexpected behavior within our system. It might be
something like an incorrect result of a calculation or a mistaken value of a state variable. Errors may (or may not) lead
to failure. A failure is a situation in which a system (or part of a system) is not performing its intended function. As we
see in Figure 1, a low-level failure in some small part of a system can be viewed as a fault at another level, which can
lead to errors at that level that can trigger failures, that can themselves be viewed as faults at yet a higher level. If
these faults are allowed to “avalanche” to system-level failures, they can lead to hazards that have the potential to
threaten injury or loss of life.
SAFETY vs. HIGH AVAILABILITY
Some readers may be thinking “Hey, this is starting to sound an awful lot like the ‘high availability’ stuff I read in
Embedded Systems Programming magazine 3 years ago.” But while there are a number of points-of-contact
between safety critical systems design and high availability systems design, the objectives of the two types of
systems are quite different and many of the design architectures they use are quite different.
Many high availability systems do not threaten human life in cases of failure, and instead are designed to maximize
“up-time” and minimize “down-time”. High availability systems today strive to be up-and-running 99.999% of the time
(“five nines availability”), equivalent to about a total of 5 minutes of “down-time” per year.
On the other hand, safety critical systems do not always strive to maximize “up-time”. In fact, they may intentionally
bring themselves “down” or bring some sub-systems “down” in situations where there is a threat of injury or loss of
life. They strive to bring the system to a “Safe State” in order to break the Fault-Error-Failure-Hazard sequence of
Figure 1, before a life-threatening situation is ever reached. For many safety-critical systems, such as medical
infusion pumps and cancer irradiation systems, the “Safe State” is to immediately stop and turn the system off.
For other safety critical systems, no safe state exists. For these systems, stopping is simply not an option. Examples
are aircraft jet engine controllers and medical respiratory ventilators. For yet other safety critical systems, safe states
do exist, but they require a complex and lengthy sequence of activities in order to bring the system to the safe state.
An example of this is an automotive brake-by-wire system. While driving on the Interstate or the Autostrada, you don’t
want your automotive system to suddenly announce “Memory Parity Error: Brakes Not Available.” A much safer
design alternative would be “graceful degradation” like “Bring to Garage Now: Only 3-wheel Braking Available”.
Another example is the car’s power window, which should open completely whenever it detects an obstacle during
upward motion. These systems must not stop or turn themselves off when a hazard is detected; instead their
embedded services must continue to be available while failures and hazards are present.
Figure 2 illustrates the relationship between safety critical and high availability systems, with regard to hazards and
|Figure 2: Safety Critical vs. High Availability Systems
|In this Venn diagram, safety critical medical infusion pumps and cancer irradiation systems would fall into the leftmost
section; while jet engine controllers, respiratory ventilators, brake-by-wire systems and power windows fall into the
center section where safety critical and high availability systems overlap. The rightmost section of the diagram is for
high-availability systems that are not safety-related, such as on-line banking, stock exchanges and business-critical
websites. Many communication systems fall into this rightmost section, apart from features such as emergency
response (“911” in U.S.A.) that fall into the center section of the Venn diagram.
The rightmost section of Figure 2 was addressed in a previous article “Design Patterns for High Availability”,
available on the web at: this website . Many of the design patterns discussed there are also useful building blocks for
safety critical systems that fall into the center section of Figure 2. I’d like to continue to focus on the left and center
sections of the diagram, covering safety critical systems.
As for all embedded systems, design is preceded by system requirements definition, covering physical and functional
specification. For safety critical systems, a thorough Hazard Analysis and Risk Analysis must also be done. Only then
can architectural design get started.
The objective of Hazard Analysis is to systematically identify the dangers to human safety that a system may pose,
including an evaluation of the likelihood of an accident resulting from each hazard. A popular technique for doing
Hazard Analysis is called “Fault Tree Analysis”. It takes a top-down hierarchical decomposition approach. But it
doesn’t decompose functions, the way we learned in freshman engineering class. Rather it involves decomposing
undesired system events – in order to identify which combinations of hardware, software, human, or other errors
could cause safety-threatening hazards.
A Fault Tree Analysis begins by asking “What are the 3 (…or 6 or 7) most life-threatening things my system could
conceivably do?” Each safety threat you come up with, will become the top node of its own “fault tree” as shown in
Figure 3. Then ask “What sorts of things could cause this to happen?” Your answer will be shown as the first level
of decomposition of the fault tree. Then ask “What sorts of things could cause each of these to happen?” Your
answers will become the next level of the fault tree; and so on. You can use the logical “AND” and “OR” symbols from
digital electronics to provide details of logical combinations in your diagrams, as in Figure 3.
|Figure 3: Fault Tree Analysis Example
An alternative, perhaps more systematic, approach to Hazard Analysis is called “Event Tree Analysis”. It is a bottom-
up approach that examines the results of operation or failure of system components and sub-systems. An Event Tree
is often diagrammed horizontally, as in Figure 4.
|Figure 4: Event Tree Analysis Example
The top-level safety-threatening event for this event tree is shown on the left, and the various system components and
sub-systems involved in handling this event are shown across the top of the figure. The specific safety-threatening
situation analyzed in this event tree is “If medical infusion pump fluid pressure fails, will the system report an alarm
as required?” For each component and sub-system involved, probabilities of successful operation and failure are
noted. When the probabilities are mathematically combined, the result of this example is that alarms will fail to be
reported 16.21% of the time. [Please note: This is example is made-up to keep the numbers simple. Real medical
systems are typically much more reliable.]
After Hazard Analysis, the next step is Risk Analysis. Risk is the combination of the probability of an undesired event,
with the severity of its consequences. It might be expressed in units such as “deaths per 100 years of system
operation”. Once the greatest risks posed by a system have been identified, they must be dealt with in the system
design: If possible, the underlying hazards should be avoided or removed. This can often be done using:
• Hardware Overrides, to bypass risky software components
• Lockouts, to prevent entry into risky states
• Lockins, to ensure remaining within safe states
• Interlocks, to constrain sequences of events in order to avoid hazards.
If it is not possible to totally avoid or remove the hazards, the risk of accident must be minimized; and if an accident
does occur the risk of loss of life must be minimized.
Together with the system requirements, the results of the hazard analysis and risk analysis will guide a safety critical
system’s architectural design.
APPROACHES TO SENSOR ERROR DETECTION
Correct sensor data are so crucial to safe operation that many systems use redundancy in their sensor data
acquisition. Redundancy does not always mean sensor replication as shown in Figure 5 using 2 identical sensors. It
could also mean functional redundancy, or the measurement of the same real-world value in two different ways. For
example, patient respiration rate can be measured both by the expansion and contraction of the rib cage, and by
measurement of expiratory CO2 concentration.
|Figure 5: Redundancy via Replicated Sensor Input Comparison
Redundancy can also be implemented as analytic redundancy, which is the comparison of a measured value with a
value derived in some other way, as shown in Figure 6.
|Figure 6: Analytic Redundancy: Comparison to Other Data
For example, the result of a position sensor measurement could be compared to a calculation of the sum of the
previous position plus velocity multiplied by elapsed time:
xt = x0 + vavg*t .
If there is known constant acceleration, the formula would instead be:
xt = x0 + v0*t + ½*aconst*(t)**2 .
High school physics does come in handy! If the calculated and measured values agree pretty closely, then we are
confident that the sensor is working correctly. Another example: in the medical world, patient heartrate can be
extracted from a signal analysis of an arterial blood pressure waveform. It can then be compared to the value
measured directly from the patient’s electrocardiographic signal (“ECG” or “EKG”) when doing analytic redundancy.
These approaches can be combined and embellished. In the approaches discussed so far, when there’s a
disagreement in a comparison we know there’s something wrong with a sensor. But we don’t know which one is
wrong. So it’s often best to just shut down the entire redundant pair, in what’s called a “Fail-Stop”. An alternative
approach is to add a third redundant element, and to replace the “comparison” with “voting”. If this would be done in a
strictly replication-based design such as Figure 5, the result is called Triple Modular Redundancy (“TMR”). But this
could also be done in a mix-and-match sort of way, resulting in a combination of several kinds of redundancy. In the
various triple redundancy approaches, a faulty sensor can be identified and shut down while the remaining redundant
elements can continue to operate safely.
If a safety-critical system has got an immediate safe state, as illustrated on the left side of Figure 2, a “Shutdown
System” can be used to terminate a hazardous situation as soon it detects it. This is illustrated in Figure 7.
|Figure 7: Basic Shutdown Architecture
The shutdown system is a dedicated unit with responsibility for identifying dangerous situations. It will force the entire
system into a safe state (“OFF”) whenever a hazard is detected, and thus lock the system out of a life-threatening
state. It is independent of the primary system that is normally “in control”, and operates in parallel with it. To ensure
its complete independence, it has its own separate sensor(s). A diagnostic sub-system is used to ensure the
integrity of operation of the shutdown system itself. If the diagnostic sub-system determines that the shutdown
system’s decisions may be untrustworthy, it can bring the entire system to an immediate safe state (“OFF”), rather
than allowing the primary system to continue to operate without trustworthy shutdown monitoring going on in parallel.
A cancer irradiation facility can be designed in this way. The primary system operates a nuclear particle accelerator
that directs a highly focused beam into a well-defined area of a patient’s body. Its sensors monitor the radiation
dosage on target. Its irradiation shutdown system, on the other hand, works with radiation sensors on other parts of
the patient’s body and in other parts of the treatment room. It will also monitor radiation dosage on target. The
irradiation shutdown system itself is evaluated by an irradiation shutdown diagnostic sub-system.
As we will see later on, the primary system in Figure 7 can be designed in a number of ways, some of them quite
complex with sophisticated redundancy built in. But as shown thus far, the shutdown system portion of the design
has no redundancy and is thus a potential single point of failure. A single faulty shutdown system could also bring a
cancer irradiation facility to a standstill, thus endangering its patients in a different way by denying them their medical
treatment. So in fact, some safety critical systems have dual shutdown systems working in parallel (with either “AND”
or “OR” logic for deciding when to shut down the primary). In extreme instances, a safety critical system can be
designed with three shutdown systems working in parallel, using a triple modular redundant (“TMR”) style of voting
among them. In this way, a faulty shutdown system can be identified and itself be shut down, while the remaining
shutdown systems can continue to operate in redundant and trustworthy fashion and the primary system continues to
provide its services.
SINGLE CHANNEL WITH ACTUATION MONITORING
The idea of a “shutdown system” can also be applied on a smaller scale within a primary system itself, as shown in
Figure 8. The ellipses represent major system activities, which could be implemented as software tasks or
processes, either on separate processors or sharing a single processor, depending on the scale of the system. A
basic primary system is structured by the simple design pattern of Input-Process-Output, shown here across the top
of the figure as the sequence labeled “Data Acquisition” --> “Processing/Transformations” --> “Output/Control”. To
lower costs, the primary system and the sensor data integrity checking “shutdown” monitoring activity (at the lower left)
are shown here as sharing the same input sensor(s).
|Figure 8: Protected Single Channel, showing Actuation Monitoring Options
The idea of “shutdown” monitoring can also be extended to the output side of a system. This is called “Actuation
Monitoring”, and it is illustrated on the right side of Figure 8 for a medical safety critical system. Actuation Monitoring
can be done in a number of ways, each with a different balance of costs vs. benefits. The most basic form of actuation
monitoring is “End Around” monitoring. It simply checks the commands to the output actuators for validity before they
reach the actuators themselves. A more stringent form is “Wrap Around” monitoring, that checks that the output
actuators are actually producing valid outputs that will soon reach the patient under treatment. A third, usually more
costly form is “Actuation Results” monitoring, that uses an independent set of sensors to verify that the system is
actually producing the results it is intended to provide.
A medical infusion pump controller could be designed in this way. Let’s assume that a stepper motor is doing the
actual pumping of fluid. “End around” monitoring could be used to check that the stepper motor is receiving the
correct (or at least “reasonable”) commands. “Wrap around” monitoring could use a fluid flow sensor to check that the
correct (or at least a “reasonable”) amount of fluid is being delivered to the patient under treatment. And “Actuation
Results” monitoring could use an invasive probe to measure the concentration of specific drugs or other contents in
the patient’s bloodstream resulting from the operation of the infusion pump.
A significant weakness of both the “shutdown system” and the “single channel” architectures is that they cannot
continue to operate safely in the presence of faults. They have single points of failure. You can see them stretching
across the top of Figure 8. This means that these architectures can only be used in safety critical systems that have
an immediate safe state, as on the left side of Figure 2.
DUAL CHANNEL ARCHITECTURES
For safety critical systems without an immediate safe state, dual channel architectures can be used to allow a system
to continue operation even when one of its channels has “Fail-Stopped”. In Figure 9 we see an illustration of a dual
channel architecture, in which each of the channels uses the single channel architecture of Figure 8.
|Figure 9: A Dual Channel Architecture
|standby or backup channel --- ready to take over system operation if the current primary channel suffers faults or
failure. Depending on the needs of the specific safety critical system, the standby channel when becoming active
could either continue normal operation of the system, or it could take the system through a possibly long and complex
sequence of steps to bring it to its eventual safe state.
For example, an operating room heart-lung-bypass machine has got to continue to deliver its life-sustaining services
even if one if its internal embedded processing channels fails. On the other hand, a nuclear reactor control system, in
cases of failure of one of its internal embedded processing channels, would be expected to stay in operation long
enough to shut down the reactor by proceeding through a lengthy sequence of activities: stepping the graphite
moderator rods down into the full depth of the reactor core, while accelerating the flow of coolant through the reactor,
and monitoring the gradual slow-down of the nuclear reaction through a myriad of sensors – until the reactor can be
declared safe for human access.
A dual channel architecture is going to have higher unit costs than previous architectures we’ve discussed. There will
be redundant embedded processing channels using redundant hardware and redundant sensors. But the big benefit
of paying this price, is the ability to continue to operate in the presence of a fault.
Dual channel architecture has a number of popular variants. If the 2 channels shown in Figure 9 use the same
replicated software and hardware, the architecture can handle random faults well; but it cannot handle systematic
faults such as software design or coding defects that would be reproduced in both channels. If this is of concern in
your system, a “heterogeneous dual channel” architecture is preferable. This kind of architecture would consist of 2
channels implemented in 2 totally different ways. For example, software for the 2 channels could be implemented by
two separate software development teams working from the same software requirements specification, in what is
called “N-Version Programming” or “Dissimilar Software”. Clearly, the development costs as well as the unit cost for
doing this would be high.
Another variant of the dual channel architecture is “multi-channel voting” architecture. This extends the “Triple
Modular Redundancy (“TMR”) approach discussed for sensor error detection, into the realm of entire replicated
processing channels. In this architecture, 3 (or more) channels operate in parallel. A “voter” compares the outputs of
the channels: If a majority of channel outputs agree, this will become the system’s output. If some channels disagree,
they will be “Fail-Stopped”.
An example of a multi-channel architecture used in aerospace applications, is the “dual-dual” architecture. Four
independent processing channels are organized into two pairs of 2 channels each. While one pair is active, the
members of that pair are continually comparing results. As long as they agree, they will continue to be active. But as
soon as they disagree, they will hand over control to the other pair, which will then become the active pair.
Many safety critical systems do not have an immediate safe state, but cannot incur the high costs of a full dual-
channel or multiple-channel architecture. A lower cost compromise solution is the “monitor-actuator” architecture
shown in Figure 10.
This architecture does not have replicated identical channels, but instead has heterogeneous channels that differ
from one another. It has a single primary “Actuation Channel” that normally controls the system, shown in the upper
portion of Figure 10. The operation and results of this channel are examined by a separate simpler “Monitoring
Channel” shown below it. If the Monitoring Channel detects a fault in the Actuation Channel, normal operation of the
Actuation Channel cannot continue. Instead, control of the system is passed to a separate “Safety Channel” shown at
the bottom of the figure, which has responsibility for bringing the system to a Safe State. Depending on the needs of
the specific safety critical system, the Safety Channel could take the system through a possibly long and complex
sequence of steps to bring it to its eventual safe state.
|Figure 10: A Monitor-Actuator Architecture
|The Monitor-Actuator architecture could be a reasonable low-cost compromise for applications such as chemical
process control or car power windows. It can also serve in applications appropriate for “graceful degradation” of
function such as automotive brake-by-wire. The Safety Channel would implement the “graceful degradation” of
The selection of a safety critical system architecture is driven by a rigorous hazard analysis followed by risk analysis,
in addition to conventional system requirements definition. System design may include combinations of redundant
sensor configurations, shutdown systems, actuation monitoring, multiple channel architectures, and/or monitor-
actuator structuring. These embedded systems architectures are much more valuable than can be measured in
dollars and cents. Their true value is in protecting and saving human lives.
N. Storey, “Safety-Critical Computer Systems”, Addison-Wesley, Harlow UK, 1996.
B. P. Douglas, “Real-Time Design Patterns”, Addison-Wesley, Boston MA, 2003.
W. R. Dunn, “Practical Design of Safety-Critical Computer Systems”, Reliability Press, Solvang CA, 2002.
University of York (UK) High Integrity Systems Engineering group, http://www.cs.york.ac.uk/hise/ .
|This material was published in the September 2005 issue of Embedded Systems Programming magazine (U.S.A.).
© Copyright 2016, D. Kalinsky Associates, All Rights Reserved.
This page Updated March 25, 2016