D. Kalinsky Associates
Home  |  Training Courses  |  Online Learning  |  Resources  |  About Us  |  Contact  |  Site Map  
Advanced Course:

"Design of High Availability Systems & Software"

*  An Advanced Course for Experienced Real-Time Embedded System Designers and Software Developers

*  How to Structure Embedded Systems and Application Software for 99.999% Availability

*  2- Day Intensive Class        (lectures, discussions, design examples, exercises)
COURSE OVERVIEW

This course examines the high-level design of embedded systems and software that are to provide their services at
near-continuous availability.

High availability systems must tolerate both expected and unexpected faults. Their design is based on redundant
hardware and software combined in ways that will achieve “five-nines” (99.999%) or greater availability, equivalent to
less than 1 second of downtime per day.   Basic hardware N-plexing and voting issues are discussed, followed by an in-
depth study of a number of backward error recovery fault tolerance techniques including static N-version programming,
Checkpoint-Rollback, Process Pairs, and Recovery Blocks. The class continues with several forward error recovery
techniques.  Technical issues such as failover management, data replication, and software design defects, are
addressed in depth.  Many real-world examples are presented.

This course is far from a general course about system or software design theory, but rather it is highly focused on the
design of embedded systems and software that must make their services available at all times, with less than 5 minutes
per year of downtime.


WHO SHOULD ATTEND ?

This course is intended for practicing real-time and embedded systems software system architects, project managers
and technical consultants who have responsibility for designing, structuring and implementing the software for real-time
and embedded computer systems that are required to continue providing service despite the occurrence of internal and
external faults.

Course participants are expected to be familiar with general embedded and real-time software design.  [This knowledge
can be gained by attending a prerequisite embedded software design course such as "
Architectural Design of Real-
Time Software".]


COURSE CO-REQUISITE

Many (but not all) high-availability systems are also safety-critical systems -- with can threaten human safety or even
human life in situations where the system fails and remains unavailable for significant periods of time.  For those high-
availability systems that also have safety-critical requirements, we recommend that the course "
Design of Safety-Critical
Systems and Software
" should be taken at the same time as this course.  The two courses have little overlap in content,
and offer complimentary approaches and perspectives.  It is possible to combine these two courses into a unified three-
or four-day course for presentation at customer sites, under the name
"Safety Critical and High Availability Systems"
Masterclass.


COURSE OBJECTIVES

The primary goal of this course is to give participants the skills necessary to design software for real-time and
embedded computer systems that must relentlessly provide service despite the occurrence of internal and external
faults.  This is a very practical, results-oriented course that will provide knowledge and skills that can be applied
immediately.


COURSE CONTENTS

Definitions and Background

High Availability
Fault -> Error -> Failure
Single Points of Failure
Fault Tree Analysis
Exercise: Probabilistic Fault Tree Analysis

Underlying Principles

Fault Avoidance vs. Tolerance
Failure Curves
Redundancy
Replication vs. Functional Redundancy vs. Analytic Redundancy
Dynamic vs. Static Redundancy
Extended Example: Space Shuttle Software

Fundamental System-Level Design Patterns

Static Hardware Fault Tolerance
N-Plex Design
Exercise: MTBF, MTTF Calculations in Triple Modular Redundancy
Dynamic System Fault Tolerance
Redundant Pairs
Clusters
Cluster Failover Strategy Choices
Examples: Redundant Cluster Design

Concepts for Backward Error Recovery

Design Diversity
Dynamic  System Redundancy
Backward Error Recovery
Transactions
Checkpointing

System and Software Design Patterns for High Availability

Checkpoint-Rollback
Process Pairs
Recovery Blocks
Limitations of Backward Error Recovery Patterns
Forward Error Recovery Design Patterns

Technical Issues in High Availability Design

Failover Management
Dealing with Software Design Faults
Extended Example: Airbus A330/340 Fly-by-Wire
Extended Example: Boeing 777 Fly-by-Wire

C Language in Critical Systems

Software Robustness: MISRA-C, LINT, Static Code Analyzers
Exercise: C-Language Shenanigans
Update on Static Code Analysis
The JPL "Power of 10" Coding Rules

Final Examination.


INSTRUCTOR:  Dr. David Kalinsky
Check it out! Get the "flavor" of this course by perusing
one of the course's reading assignments:
"
Design Patterns for High Availability"
Price of an On-Site Course
Schedule an On-Site Course
© Copyright 2011, D. Kalinsky Associates, All Rights Reserved.
This page Updated
January 18, 2011