As a SoftLayer sales engineer, I get the opportunity to talk to a wide range of customers on a daily basis about almost everything under the sun. This is one of my favorite parts of working at SoftLayer: every day is unique and the topics range from a standalone LAMP server to thousands of servers in a big data cluster—and everything in between. It can be challenging at times, due to the infinite number of solutions that SoftLayer can run, but it also gives me the chance to learn and teach others. In this blog post, I’ll discuss high availability (HA), disaster recovery (DR), global server load balancing (GSLB), and load balancing (LB), as I occasionally hear customers mix up the terms, and I think a little clarity on the topics could help.
Before we dive into the differences, first let’s first define each in alphabetical order (I did take a stab at stating this in my own words, but Wikipedia does such a good job that I paraphrased from its descriptions and added in a little more context).
- High availability (HA): HA is a characteristic of a system, which aims to ensure an agreed level of operational performance for a higher than normal period. There are three principles of system design in high availability engineering: the elimination of single points of failure (SPOF), reliable failover, and failure detection.
- Disaster recovery (DR): DR involves a set of policies and procedures to enable the recovery or continuation of systems following a natural or human-induced disaster. Disaster recovery focuses on keeping all essential aspects of a business functioning despite significant disruptive events.
- Global server load balancing (GSLB): GSLB is a method of splitting traffic across multiple servers using DNS and geographical locations as the means to determine where request traffic will be sent.
- Load balancing (LB): LB is a way to distribute processing and communications evenly across multiple servers within a data center so that a single device does not carry an entire load. LB is essential in situations where it is difficult to predict the number of requests issued to a server, and it can distribute requests that would have been made to a single server to ease the load and minimize latency and other issues.
Now that we've defined each of these topics, let’s quickly check off the main points of each topic:
- No single points of failure (SPOF)
- Each component of a system has as at least one failover node
- If a server is part of an HA pair, it is recommended to run the OS on at least a RAID 1 group and DATA partitions on a RAID 1, 5, 6,10, or higher group
- If the system is part of a cluster, it is always recommended to run the OS on at least a RAID 1 and DATA partitions can be optimized for storage capacity
- Redundant power
- Dual path networking/uplinks
- Utilize portable IP addresses for HA/service configurations as primary IPs assigned directly to a server or VLAN is specific to that instance and can lead to IP conflicts or unintended disruption in service
- Database systems are configured at the application for HA or clustering
- Web/app systems are configured at the OS or app in a HA pair or are placed behind a load balancer
- Companies should analyze their infrastructure and personnel assignment to identify mission-critical system components and personnel
- A plan should be developed to identify and recover from a disaster; this plan should also include recovery time objective (RTO) and recovery point objective (RPO) to reflect the business model
- A secondary data center (DC) [Office1] is recommended to mitigate risks of a major natural or human disaster
- Mission-critical systems should be on standby or quickly deployable to meet or beat a company’s stated RTO
- Backup data should be stored offsite and ideally at the secondary DR site to reduce recovery time
- Once a plan is in place, mock fail-overs should be performed regularly to ensure the DR plan is fully executable and all parties understand their roles
- Complete, independent systems should be deployed into two or more DC locations
- Each location is accessible via a unique IP address(es)
- Data systems should be designed to operate regionally independent and possibly synchronized on-schedule or on-demand
- Each location hosts at least one LB instance that supports GSLB
- Based on availability of each site, the location of a user, or data sovereignty regulations, users are directed to an available site via DNS resolution
- Once a user has been directed to a site, standard load balancing takes precedence until the time to live (TTL) of the DNS resolution expires
- Each server within a LB pool should reside in the same DC as the LB, or performance may degrade and health checks may fail
- A minimum of two servers should be included in a LB pool
- Load should be spread across servers based on the specification of each server; if all servers are equal in specs, the load should be shared equally
- Each server in a LB pool will need a public IP address and active public interface to respond to Internet requests
- When possible, it is recommended to leverage LB features such as SSL offload to minimize load on web servers
I hope this clarifies the terms and uses of HA, DR, GSLB, and LB. Without background, tech jargon can be a bit ambiguous. In this case, some of the terms even share some of the same acronyms, so it’s easy to mix them up. If you haven't had a chance to kick the tires of the SoftLayer LB offerings or if you’re looking to build a DR solution on SoftLayer, just let us know. We’ll be happy to dive and help you out.