Systems Design - Availability

What is Availability?

Availability is defined as the probability of a server/service being up and running at any point in time, defined in percentages. 80% availability means that the server was up for 80% of the time.

$$ Availability = \frac{Uptime}{Uptime + Downtime} \\ $$

As a customer, the ideal availability of any system should be 100%. However, this isn't always the case. So, let's consider what we would call as a Highly Available system.

High Availability

An Highly Available (HA) server is a server that has high levels of availability, typically 99.999% or greater. High Availability is generally defined in terms of nines. In this case, 5 nines or above is considered as an highly available system.

99.999% ("five nines")

Donwtime per year = 5.26 minutes
Downtime per month = 26.30 seconds
Downtime per day = 864.00 milliseconds

Similarly, 7 nines equates to 99.99999% availability of a system.

Check out uptime.is for a quick conversion from SLA to downtime.

SLA

Short for "service-level agreement", an SLA is a collection of guarantees given to a customer by a service provider. SLAs typically make guarantees on a system's availability, amongst other things. SLAs are made up of one or multiple SLOs, or "service-level objectives". They generally define availability of a system by the number of nines.

Redundancy

Redundancy refers to the process of replicating parts of a system to make it more reliable. This is usually used to prevent a "Single Point Of Failure (SPOF)". For example, if you have a single server, then add another server as a redundancy. Similarly, adding a load balancer to make sure the load between the two servers is balanced. However, the load balancer now is a SPOF, so then you'd need to add another load balancer to make sure the system's availability is maintained.

Passive Redundancy

When several entities of the system perform the same task, it's a passive redundant system. Incase of downtime in one entity, the other entities continue operating as-is, without any changes in the system except for higher loads. For example, twin engines on an aircraft.

Active Redundancy

Active Redundancy is found in complex systems where one or a few entities are responsible for operations. And there exists some communication between the entities in the system such that downtime in the active entities leads the inactive entities of the system to take over and resume the operations without any hit to the system's availability.

Summary

If a System needs to Highly Available, then eliminating all Single Point Of Failures by adding redundancies is of upmost importance. Also, there need to be processes in place to make sure that human intervention to fix availability issues is done in the right timeframe.

For more, check out the Wikipedia page for Highly Available Systems