Posts Tagged 'Disaster Recovery'

June 30, 2016

HA, DR, GSLB, LB: The What’s What and Who’s Who of Uptime

As a SoftLayer sales engineer, I get the opportunity to talk to a wide range of customers on a daily basis about almost everything under the sun. This is one of my favorite parts of working at SoftLayer: every day is unique and the topics range from a standalone LAMP server to thousands of servers in a big data cluster—and everything in between. It can be challenging at times, due to the infinite number of solutions that SoftLayer can run, but it also gives me the chance to learn and teach others. In this blog post, I’ll discuss high availability (HA), disaster recovery (DR)global server load balancing (GSLB), and load balancing (LB), as I occasionally hear customers mix up the terms, and I think a little clarity on the topics could help.

Before we dive into the differences, let’s define each in alphabetical order (I did take a stab at stating this in my own words, but Wikipedia does such a good job that I paraphrased from its descriptions and added in a little more context).

  • High availability (HA): HA is a characteristic of a system, which aims to ensure an agreed level of operational performance for a higher than normal period. There are three principles of system design in high availability engineering: the elimination of single points of failure (SPOF), reliable failover, and failure detection.
  • Disaster recovery (DR): DR involves a set of policies and procedures to enable the recovery or continuation of systems following a natural or human-induced disaster. Disaster recovery focuses on keeping all essential aspects of a business functioning despite significant disruptive events.
  • Global server load balancing (GSLB): GSLB is a method of splitting traffic across multiple servers using DNS and geographical locations as the means to determine where request traffic will be sent.
  • Load balancing (LB): LB is a way to distribute processing and communications evenly across multiple servers within a data center so that a single device does not carry an entire load. LB is essential in situations where it is difficult to predict the number of requests issued to a server, and it can distribute requests that would have been made to a single server to ease the load and minimize latency and other issues.

Now that we've defined each of these topics, let’s quickly check off the main points of each topic:

HA

  • No single points of failure (SPOF)
  • Each component of a system has as at least one failover node

Hardware Recommendations

  • If a server is part of an HA pair, it is recommended to run the OS on at least a RAID 1 group and DATA partitions on a RAID 1, 5, 6,10, or higher group
  • If the system is part of a cluster, it is always recommended to run the OS on at least a RAID 1 and DATA partitions can be optimized for storage capacity 
  • Redundant power

Network Recommendations

  • Dual path networking/uplinks
  • Utilize portable IP addresses for HA/service configurations as primary IPs assigned directly to a server or VLAN is specific to that instance and can lead to IP conflicts or unintended disruption in service
  • Database systems are configured at the application for HA or clustering
  • Web/app systems are configured at the OS or app in a HA pair or are placed behind a load balancer

DR

  • Companies should analyze their infrastructure and personnel assignment to identify mission-critical system components and personnel
  • A plan should be developed to identify and recover from a disaster; this plan should also include recovery time objective (RTO) and recovery point objective (RPO) to reflect the business model
  • A secondary data center is recommended to mitigate risks of a major natural or human disaster
  • Mission-critical systems should be on standby or quickly deployable to meet or beat a company’s stated RTO
  • Backup data should be stored offsite and ideally at the secondary DR site to reduce recovery time
  • Once a plan is in place, mock fail-overs should be performed regularly to ensure the DR plan is fully executable and all parties understand their roles

GSLB

  • Complete, independent systems should be deployed into two or more DC locations
  • Each location is accessible via a unique IP address(es)
  • Data systems should be designed to operate regionally independent and possibly synchronized on-schedule or on-demand
  • Each location hosts at least one LB instance that supports GSLB
  • Based on availability of each site, the location of a user, or data sovereignty regulations, users are directed to an available site via DNS resolution
  • Once a user has been directed to a site, standard load balancing takes precedence until the time to live (TTL) of the DNS resolution expires

LB

  • Each server within a LB pool should reside in the same DC as the LB, or performance may degrade and health checks may fail
  • A minimum of two servers should be included in a LB pool
  • Load should be spread across servers based on the specification of each server; if all servers are equal in specs, the load should be shared equally
  • Each server in a LB pool will need a public IP address and active public interface to respond to Internet requests
  • When possible, it is recommended to leverage LB features such as SSL offload to minimize load on web servers

I hope this clarifies the terms and uses of HA, DR, GSLB, and LB. Without background, tech jargon can be a bit ambiguous. In this case, some of the terms even share some of the same acronyms, so it’s easy to mix them up. If you haven't had a chance to kick the tires of the SoftLayer LB offerings or if you’re looking to build a DR solution on SoftLayer, just let us know. We’ll be happy to dive in and help you out.

- JD

 

Categories: 
June 27, 2016

Disaster Recovery in the Cloud: Are You Prepared?

While the importance of choosing the right disaster recovery solution and cloud provider cannot be understated, having a disaster recovery runbook is equally important (if not more). I have been involved in multiple conversations where the customer’s primary focus was the implementation of the best-suited disaster recovery technology, but conversation regarding DR runbook was either missing completely or lacked key pieces of information. Today, my focus will be to lay out a frame work for what your DR runbook should look like.

“Eighty percent of businesses affected by a major incident either never re-open or close within 18 months.” (Source: Axa Report)

What is a disaster recovery runbook?

A disaster recovery runbook is a working document that outlines a recovery plan with all the necessary information required for execution of this plan. This document is unique to every organization and can include processes, technical details, personnel information, and other key pieces of information that may not be readily available during a disaster situation.

What should I include in this document?

As previously stated, a runbook is unique to every organization depending on the industry and internal processes, but there is standard information that applies to all organizations and should be included in every runbook. Below is a list of the most important information:

  • Version control and change history of the document.
  • Contacts with titles, phone numbers, email addresses, and job responsibilities.
  • Service provider and vendor list with point of contact, phone numbers, and email addresses.
  • Access Control List: application/system access and physical access to offices/data centers.
  • Updated organization chart.
  • Use case scenarios based on DR testing, i.e., what to do in the event of X, and the chain of events that must take place for recovery.
  • Alert and custom notifications/emails that need to be sent for a failure or DR event.
  • Escalation procedures.
  • Technical details and explanation of the disaster recovery solution (network layouts, traffic flows, systems and application inventory, backup configurations, etc.).
  • Application-based personnel roles and responsibilities.
  • How to revert back and failover/failback procedures.

How to manage and execute the runbook

Processes, applications, systems, and employees can all change on a daily basis. It is essential to update this information in the DR runbook on a regular basis to ensure the accuracy of the document.

All relevant employees should receive DR training and should be well informed of their roles and responsibilities in a DR event. They should be asked to take ownership of certain tasks, which should be well documented in the runbook.

In short, we all hope to avoid a disaster. But when it happens, we must be prepared to tackle it. I hope the information above will be helpful in taking the first step towards preparing a DR runbook. Please feel free to contact me for additional information or guidance.

-Zeb

 

Subscribe to disaster recovery