The concept Single Point of Failure refers to the fact that somewhere between your clients and your servers there is a single point that if it fails downtime happens. The SPoF can be the server, the network, or the power grid. The dragon Single Point of Failure is always going to be there stalking you; the idea is to push SPoF far enough out to where you have done the best you can with your ability and budget.
At the server level you could combat SPoF by using redundant power supplies and disks. You can also have redundant servers fronted by a load balancer. One of the benefits when using load balancer technology is that the traffic for an application is spread between multiple app servers. You have the ability to take an app server out of rotation for upgrades and maintenance. When you’re done you bring the server back online, the load balancer notices it UP on the next check and the server is back in service.
Using a NetScaler VPX you can even have two groups of servers—one group which generally answer your queries and another group which usually does something else—with the second group functioning as a backup against all of the primary servers for a service having to be taken down through the Backup Virtual Server function.
Result: no Single Point of Failure for the app servers.
What happens if you are load balancing and have to take the load balancer out of service for upgrades or maintenance? Right, now we’ve moved SPoF up a level. One way to handle this is by using the NetScaler VPX product we have at SoftLayer. A pair of VPX instances (NodeA/NodeB) can be teamed in a failover cluster so that if the primary VPX is taken down (either by human action or because the hardware failed) the secondary VPX will begin answering for the IPs within a few seconds and processing the actions. When you bring NodeA back online it slips into the role of secondary until such time as NodeB fails or is taken down. I will note here that VPX instances do have dependency on certain network resources and that dependency can take both VPX instances down.
Result: Loss of a single VPX is not a Single Point of Failure.
So what’s next? A wide-ranging power failure or general network failure of either the frontend or the backend network could render both of the NetScalers in a city unusable or even the entire facility unusable. This can be worked around by having resources in two cities which are able to process queries for your users and by using the Global Load Balancer product we offer. GLB load balances between the cities using DNS results. A power failure taking down Seattle just means your queries go to Dallas instead. Why not skip the VPX layer and just GLB to the app servers? You could, if you don’t have a need for the other functionalities from the VPX.
Result: no single point of failure at the datacenter level
Having redundant functionality between cities takes planning, it takes work, and it takes funding. You have to consider synchronization of content. The web content is easy. Run something like an rsync from time to time. Synching the database content between machines or across cities is a bit more complicated. I’ve seen some customers use the built-in replication capabilities of their database software while others will do a home-grown process such as having their application servers write to multiple database servers. You also have to consider issues of state for your application. Can your application handle bouncing between cities?
Redundancy planning is not always fun but it is required for serious businesses, even if the answer is ultimately to not do any redundancy. People, hardware and processes will fail. Whether a failure event is a nightmare or just an annoyance depends on your preparation.