What is Fault Tolerance: Protecting Your System’s Health

Tyler Au
6 min
April 20th, 2023
Tyler Au
6 min
April 20th, 2023

What is Fault Tolerance?

Did you know that roughly 5 billion people around the globe use the internet DAILY?

In an age where we’re more connected than over, ensuring that apps and systems are up and running is of the utmost importance. High availability and healthy systems are amongst two of the top priorities for many organizational teams, with many organizations prioritizing backups for their most important systems.

To combat random outages, organizations are practicing fault tolerance in many of their systems. Fault tolerance refers to a system’s ability to resume operation when faced with interruption and/or a component failure. Think of fault tolerance as the best 4-wheeled car ever: if one of the tires were to pop, this fault tolerant car would be able to continue its normal operations with just 3 wheels (or a 4th wheel replacement).

The purpose for implementing a fault tolerant system within a company is to prevent disruptions arising from a single point of failure (SPOF) in order to safeguard high availability and business continuity (being able to deal with difficult situations so an organization can operate without disruption). Fault tolerant design removes system risks proposed by a SPOF, or a single system fault that can cease the system operation entirely if malfunctioned, through load balancing and failovers. These three solutions waged by fault tolerance allows systems to allocate traffic based on resources and seamlessly switch to backup systems, respectively.

Although fault tolerance and high availability are deeply rooted in each other, the two aren’t necessarily the same. Fault tolerance is the system design that allows high availability to be obtainable and prosper, if a system fails with fault tolerant design, high availability would be nearly impossible to achieve. A business continuity plan entails having both of these aspects in check, making sure that operation is able to run as smoothly and as efficiently as possible in all cases.

The Importance of Load Balancing and Failovers

To start, let’s define both of these aspects of fault tolerance.

Load balancing is the method of network traffic distribution in which traffic is distributed appropriately across a pool of servers and resources. This workload distribution allows systems to optimize how they’re handling traffic, ensuring that high volume is handled and that stress isn’t placed upon a singular server. 

Failovers refers to the system’s ability to switch over to a backup system seamlessly and automatically in response to failure. Once components fail, the system will be switched to a backup system that mirrors the operation of the primary system and is actually running in tandem to it. 

Load balancing and failovers remove the risk of having a SPOF within a system, which is of the utmost importance for true fault tolerance. 

Fault Tolerant Operations

At its core, fault tolerance has two models that it abides by:

Normal Functioning 

Normal function refers to a situation when a fault tolerant system actually has a faulty component, but continues to operate normally. The system in question experiences no change in performance or metrics due to service interruption due in part to redundant components.

Graceful Degradation

When failure occurs, the impact a fault can have on a system depends on the severity of the failure. Certain fault tolerant systems will experience graceful degradation, where the severity of the fault is proportionate to the impact on performance. Like the saying “the bigger they are, the harder they fall”, the bigger and more severe a fault is, the harder a system will fall.

Where You Might Need to Inspect

Most, if not all, fault tolerant systems contain backup components that seamlessly replaced faulty components to ensure that a critical service is able to continue operating as a system fails, these components are:

Hardware

Hardware systems can be backed up by systems that are identical or equivalent to them. These backup systems run in tandem with the primary systems and mirror the operations. Think of the car analogy made earlier: a spare tire would be a good example of a replacement hardware system.

Software 

Software systems become fault tolerant once they’re backed up by other software. For example, having sensitive customer information backed up into a separate, isolated database and on entirely different computer systems ensures that related services continue operating in the case of a system failure. As mentioned above, failovers and having redundant components in case of system failures are of the utmost importance, with this philosophy applying heavily to software components in a fault tolerant system.

Power Sources

Power sources are able to become fault tolerant once a replacement power source is identified and ready. Having an alternative power source at the ready once a system experiences a loss of service is absolutely critical to uninterrupted operation. 

Components of a Fault Tolerant System

Achieving fault tolerance seems simple enough, right? If you’ve read to this point, it seems that having fault tolerance just means having backups for everything, and if you think that then you’re partially correct. However, fault tolerant systems and the means to improve them are made up of a few different components:

Diversity

Although having identical backups is extremely useful in the case of a service interruption, diversifying alternative sources could mean well in the long run. Like a Swiss-army knife, fault tolerant systems should be equipped with all the backups and alternatives necessary, even if having them might be redundant for a bit.

Redundancy

Redundancy within a fault tolerant system is a means of removing a SPOF. Once a component fails, the system will be able to identify the fault and automatically replace it, without impeding operations. For example, in the case of hardware, a redundant array of independent disks (RAID) is usually put in place in the case of a system or hardware failure, letting users protect and access mirrored data without interruption.

Replication

In order to create the same instance of software as when a fault occurred, replication is extremely important. This would mean having the same version of software as the primary one, while having it run and operate as the primary one would. Function, tests, results- everything is identical to the primary system in the case that it would eventually replace the primary.

Fault Tolerance with Lyrid

At Lyrid, we’re enthusiastic about saving your applications and services from harmful events (while saving you time and money in the process). In order to address fault tolerance, we turn to our usage of microservices.

Through the use of microservices on Lyrid, you’ll be able to compartmentalize different apps and services that are running, meaning that if an application were to go down, this will be an isolated event and won’t impact your other services. This practice also translates to how we use Kubernetes; if a node were to stop, other machines would distribute the faulty node’s load and balance the traffic.

In securing your applications and services, we also offer disaster recovery and offsite backup machines, with regularly updated data.

Want to learn more about how we protect your system? Book a demo here!

Schedule a demo

Let's discuss your project

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

99 South Almaden Blvd. Suite 600
San Jose, CA
95113

Jl. Pluit Indah 168B-G, Pluit Penjaringan,
Jakarta Utara, DKI Jakarta
14450