Posts Tagged 'Uptime'

June 30, 2016

HA, DR, GSLB, LB: The What’s What and Who’s Who of Uptime

As a SoftLayer sales engineer, I get the opportunity to talk to a wide range of customers on a daily basis about almost everything under the sun. This is one of my favorite parts of working at SoftLayer: every day is unique and the topics range from a standalone LAMP server to thousands of servers in a big data cluster—and everything in between. It can be challenging at times, due to the infinite number of solutions that SoftLayer can run, but it also gives me the chance to learn and teach others. In this blog post, I’ll discuss high availability (HA), disaster recovery (DR)global server load balancing (GSLB), and load balancing (LB), as I occasionally hear customers mix up the terms, and I think a little clarity on the topics could help.

Before we dive into the differences, let’s define each in alphabetical order (I did take a stab at stating this in my own words, but Wikipedia does such a good job that I paraphrased from its descriptions and added in a little more context).

  • High availability (HA): HA is a characteristic of a system, which aims to ensure an agreed level of operational performance for a higher than normal period. There are three principles of system design in high availability engineering: the elimination of single points of failure (SPOF), reliable failover, and failure detection.
  • Disaster recovery (DR): DR involves a set of policies and procedures to enable the recovery or continuation of systems following a natural or human-induced disaster. Disaster recovery focuses on keeping all essential aspects of a business functioning despite significant disruptive events.
  • Global server load balancing (GSLB): GSLB is a method of splitting traffic across multiple servers using DNS and geographical locations as the means to determine where request traffic will be sent.
  • Load balancing (LB): LB is a way to distribute processing and communications evenly across multiple servers within a data center so that a single device does not carry an entire load. LB is essential in situations where it is difficult to predict the number of requests issued to a server, and it can distribute requests that would have been made to a single server to ease the load and minimize latency and other issues.

Now that we've defined each of these topics, let’s quickly check off the main points of each topic:


  • No single points of failure (SPOF)
  • Each component of a system has as at least one failover node

Hardware Recommendations

  • If a server is part of an HA pair, it is recommended to run the OS on at least a RAID 1 group and DATA partitions on a RAID 1, 5, 6,10, or higher group
  • If the system is part of a cluster, it is always recommended to run the OS on at least a RAID 1 and DATA partitions can be optimized for storage capacity 
  • Redundant power

Network Recommendations

  • Dual path networking/uplinks
  • Utilize portable IP addresses for HA/service configurations as primary IPs assigned directly to a server or VLAN is specific to that instance and can lead to IP conflicts or unintended disruption in service
  • Database systems are configured at the application for HA or clustering
  • Web/app systems are configured at the OS or app in a HA pair or are placed behind a load balancer


  • Companies should analyze their infrastructure and personnel assignment to identify mission-critical system components and personnel
  • A plan should be developed to identify and recover from a disaster; this plan should also include recovery time objective (RTO) and recovery point objective (RPO) to reflect the business model
  • A secondary data center is recommended to mitigate risks of a major natural or human disaster
  • Mission-critical systems should be on standby or quickly deployable to meet or beat a company’s stated RTO
  • Backup data should be stored offsite and ideally at the secondary DR site to reduce recovery time
  • Once a plan is in place, mock fail-overs should be performed regularly to ensure the DR plan is fully executable and all parties understand their roles


  • Complete, independent systems should be deployed into two or more DC locations
  • Each location is accessible via a unique IP address(es)
  • Data systems should be designed to operate regionally independent and possibly synchronized on-schedule or on-demand
  • Each location hosts at least one LB instance that supports GSLB
  • Based on availability of each site, the location of a user, or data sovereignty regulations, users are directed to an available site via DNS resolution
  • Once a user has been directed to a site, standard load balancing takes precedence until the time to live (TTL) of the DNS resolution expires


  • Each server within a LB pool should reside in the same DC as the LB, or performance may degrade and health checks may fail
  • A minimum of two servers should be included in a LB pool
  • Load should be spread across servers based on the specification of each server; if all servers are equal in specs, the load should be shared equally
  • Each server in a LB pool will need a public IP address and active public interface to respond to Internet requests
  • When possible, it is recommended to leverage LB features such as SSL offload to minimize load on web servers

I hope this clarifies the terms and uses of HA, DR, GSLB, and LB. Without background, tech jargon can be a bit ambiguous. In this case, some of the terms even share some of the same acronyms, so it’s easy to mix them up. If you haven't had a chance to kick the tires of the SoftLayer LB offerings or if you’re looking to build a DR solution on SoftLayer, just let us know. We’ll be happy to dive in and help you out.

- JD


January 11, 2011

Jurassic Park, Uptime, And You!

Some of you may remember in the movie Jurassic Park where the park founder's granddaughter Lex, played by Ariana Richards, sits down at a computer terminal, gasps, and says "This is Unix. I know this!" That particular film moment has always resonated with me as a victory for realistic depiction of computer systems - the interface used in the movie is called fsn and was an actual Unix file manager - in an industry rife with horrific exaggerations; Swordfish, anyone? I'm sure there's an unwritten story as to how she (or her brother if you follow the book) gained her skills at a computer system that in 1993 was almost exclusively relegated to universities. However, I digress. Shortly before that scene was another scene and catchphrase that should resound with familiarity to system administrators around the world. In the face of marauding dinosaurs and computer sabotage, the character John Arnold, played by Samuel L. Jackson, must sacrifice what I'm sure was an absurd amount of uptime by killing the power and rebooting the mainframe. Would the system come back up? Would everything load up as needed to get the park's systems back online? John's mantra was simple: "Hold on to your butts!" Every day as a Systems Administrator I'm faced with a comparable (though far less exhilarating) situation. Linux is an extremely stable operating system, and I have logged into systems that have been online for quite literally years. Eventually, though, kernel updates or stray mounts necessitate a reboot. Will the server's filesystems need a check on reboot? Will the server even come back up? When a server's been online for that long, the only way to know is to "throw the switch" and cross your fingers. One way to have a better idea of how your system will behave during reboots in a production environment is to take the time to update your kernel once a month or so and perform a reboot to make sure the update sticks. This allows routine file system checks to take place as necessary and keeps your system abreast of the latest kernel updates. It also familiarizes you with how long the process takes, what sort of caveats you may run into, and reduces the overall surface area of your server to outside attackers. In the last year, I have seen at least two exploits that can give an attacker root access to a server running an outdated kernel using common toolkits that can attack commonly deployed Content Management Systems with trivial effort. Compromising an unprivileged user account gives an attacker even more leverage against unpatched systems. Google CVE-2009-2695 and CVE-2010-3081 if you don't believe me. If you run a production system or even a backend system that is exposed to the big, bad Internet, it is absolutely essential to make sure that your kernel, software, and security measures are up to date. Today's Slashdot article is tomorrow's exploit. What lesson can we learn from the unfortunate folks at Jurassic Park? Don't assume your server is safe and don't wait until there are velociraptors roaming your halls looking for a snack to perform proper maintenance on your system. -Autumn

October 12, 2010

What Does it Cost (Part 1)

The Overview
I normally like to have a little fun in the blogs that I write and maybe even take the occasional jab at our CFO Mike Jones (all kidding aside about pink shirts and what not he is a really great guy). This blog is intended to have more of a educational goal, and since there is a lot to take into consideration I won’t be able to make any pink shirt cracks, and the reason for this is because I’ve had a lot of conversations over the past year or two in which the question that always comes up is “How does SoftLayer compare to colocation and what is the better move for me?” We’ll look into this further throughout the blog series.

I was fortunate enough to be invited to attend the Network World IT Roadmaps events in both New York and Atlanta earlier this year. Now what motivated me to put fingers to keyboard here is the perspective I gained from many people that I talked to during and after the conference. I consider myself to be fortunate to attend because it is rare that SLales staff is able to join in on the marketing campaign and work with people more on a face to face basis. Normally SoftLayer Sales member cannot really help our customers if we are not at our desk to take their calls, chats, emails, or tickets. I enjoy attending events like these because it seems that you can learn so much more speaking with someone face to face as opposed to just over a phone call or email.

Since this was not my first go around with the Network World events I was more familiar with the setup and I was able to take more in from the people speaking at the event. There are some common themes that can affect business from the technology side of things, and if you want to have growth you must invest into your own infrastructure and your own technology. If you are a small mom and pop shop that is fine with maintaining the status quo it may not be as vital for you, but then again you wouldn’t be reading this blog post now would you? The themes I saw (broken down into more simple context) were based around some basic principles.

  • A company is a grouping of people working for a common goal. Your people are your most valuable asset and it is important to put them in positions where they can be successful and ultimately you will be successful as well.
  • The Wayne Gretzky quotes of “A good hockey player plays where the puck is. A great hockey player plays where the puck is going to be”, and following that up with “I skate to where the puck is going to be, not where it has been” these have a common sense idea that if you are not looking to the future and figure out what is coming next then you will always be trying to catch up. If you are not innovating or growing then ultimately you are dying.
  • How can I get more? We are constantly pressured to do more with less, or at least get more out of what we already have. This is probably the biggest and most frequent question we all get no matter what our business model is and what we try to achieve.

There are, of course, many other themes than the ones I have just listed and more specific ones too. Even though I certainly took much more away these were some of the main takeaways that brought me back to an always evolving answer to the same question that every speaker seemed to dance around - “What does it cost?”

No matter how big you are or how much budget you have in place there will always be different options presented to you on how to build up your infrastructure. I have no doubt that you have asked yourself the question of what will it cost in relation to many things and possibly asked yourself in many different ways. Making comparisons to figure out what is the cost and what will give me the best possible results is the end goal we are trying to reach. But how can we get there? It can be very difficult to compare data centers to each other in an apple to apples fashion. There are simply too many variables to note in making this all come forth full stream. My goal is to try and help us all tackle this broad issue, and hopefully it will lead to more discussion about pros and cons so that it can be easier to determine the best course of action in future planning.

There are a lot of things to consider in the cost of running a data center. It seems like a never ending list of essential things that cost both money and time (which in some cases can be more valuable). In this series of blogs we’ll break specifics parts of a data center down into the basics of several areas that you’d need to consider. Once we get into the basics we’ll want to look back to ask “what does it take to run a data center?” Most often people only look at the most tangible items with the easiest metrics to apply which essentially comes down to the server hardware, power, space, and bandwidth. Sometimes these are the only things that people look at in making this decision.

Depending who you are and what you want to get out of your data center this could be close to what you’d need to consider, but for 99% of the population who has any business with a data center this only covers the basics. As a society convenience plays an ever increasing role in what we look for and in addition to this 99% looking for data center infrastructure crave things like uptime, speed, reliability, and space/opportunity for scalability and expansion. Each of these things are more than just desires, they are verified needs.

So in getting to the meat of what this blog is about I’ll quickly discuss the different things that add to the total cost beyond the obvious things of Hardware, Space, Power, and Bandwidth. I know this is already pretty long for a blog so I am turning this into a short series and I will follow up with addition blogs to go into more depth about each portion and how they can relate to each other. I will work to add insight from other customers who have asked this of themselves before in addition to giving my own experiences on this topic.

Opportunity Costs
I consider the idea of Opportunity Costs to be amongst the highest and least quantifiable aspect in running a data center. This isn’t something that will have its own blog post because of its broad nature, so instead I’ll simply tie the idea of Opportunity Cost into each other blog and how it relates to the overall discussion.

There is often a simple truth to knowing or stating that if we choose option “A” it will negate the value, relevance, and in many cases the existence of any other previously viable options. Nearly all Opportunity Costs relates back to What Does it Cost by determining what is potentially to be either gained or lost with that decision. This idea can be further broken down into risk vs. reward, and a simple business decision in knowing that if you wish to take on less risk, you’ll need to pay more for it or get less in return. The same can be said for intangibles other than risk like convenience, reliability, and speed.

Human Resource costs
Earlier, I mentioned that one of the main topics of discussion that guest speakers emphasized was that Our people are our biggest assets, but at the same time they can also easily be one of our biggest costs. I think that a lot of businesses can agree with this statement, however, the impact from how we develop our infrastructure does not often take our people and associated costs into account. Every business should have a growth model the cost of growth (or your growing pains) is often overlooked in the planning stages. We’ll look at specific situations and take into account amount of people needed running everything yourself and what that will wind up costing from just the HR standpoint.

This can get more into what is the cost of adding one more qualified employee. This is one of the biggest aspects often overlooked, because it not only takes new people you would need to hire, but how it can monopolize time and production you would get otherwise from people you already have on staff.

The value of "On-Demand" and the cost of not having it.
Have you ever heard the phrase “time is money”? What does this mean to you? What can this mean in a data center? Here we’ll focus the conversation on efficiency and the compare certain costs and benefits between different ways about achieving our goals.

We can take a look at standard processes that we may have to go through if we wish to add capacity as well as integrating new solutions with existing ones. Time has a huge value in today’s business world, and we’ll determine how having on demand infrastructure has the ability to positively impact the bottom line immensely. Having necessary tools in a truly on-demand and versatile environment will be a major point of focus in everything moving forward, and it is an important intangible factor that we should not lose sight of.

Cost of Uptime/ Redundancy
Uptime is one of the most common themes near the top of everyone’s list for data center management. We can all agree that uptime is important, but how important is it to us each individually? We will look at scenarios where if a catastrophic event were to happen we should ask ourselves what it would cost not only in terms of monetary value, but also what would that mean long term and on a strategic level.

Downtime will eventually happen in all things, but if you can plan around this to have redundancy or failover then you can alleviate this risk. So we must again ask ourselves “what will this cost?” Simply put Redundancy can and will be expensive. Generally it will cost much more than just the sum of its parts and it is easy to over look certain aspects of where you may have a “single point of failure”. At the same time we should consider what will the cost be for each additional level of redundancy that we incorporate?

In this blog we will relate focus heavily on two main ideas: The value of time in making long term decisions and Opportunity Cost. We’ll be able to look at what having long term commitments really cost in ways that include scalability, large capital costs, accounting on physical resources and their benefits as well as limitations. Once we have this established we can also more easily determine how this can affect your decision making and your ongoing ability to do the right thing for your business.

Different accounting practices can make a great difference in your bottom line. Carrying on additional debt, taxes, and taking depreciation can have a lot of costs that go beyond the normal operating costs. For this section I’ll warrant the help of some of our experts who have already previously run several scenarios and may be a bit more qualified than I am to speak on such matters.

In the end this study can make it easier to compare and see if SoftLayer is the right solution for you or someone you may know. I can say that SoftLayer will not be the entire solution for some companies compared to doing things yourself, however, we do make sound business sense in about 95% of cases at some capacity if not full capacity.


Subscribe to uptime