Posts Tagged 'Failure'

February 26, 2010

Hero or Failure?

You’re hired, welcome to the company! All you techies out there have heard that before. Then for the first couple of weeks you get the luxury of, “just take a look around the network and see what you see, make a note of what is good and what needs some work”. You make a few notes during your two week honeymoon period and then you hit the ground running. You make changes to a few of the server configs to speed them up, and you notice that there are a couple of hard drives in the server farm that are showing they are about to fail and you make a note to get that fixed. Everyone on the team hails your progress, smarts, and work ethic and thinks they have made the right choice. Even though the in-house gear is a little old you have made changes that made things faster and more redundant in your first month. Great Job! You are on your way to the Information Systems Hero title.

Everything is going along great at about the 8 month point. You have made a few key decisions along the way and have some of your gear outsourced now. All the ancient hardware onsite has been retired and liquidated and just a few core machines remain. You still have a large storage device and a tape robot onsite for your backups and you keep the tape library safely offsite. All is good in the department.

If you want to be the Hero skip to the word HERO / If you want to be a failure please skip 2 paragraphs to the word FAILURE


You have a free day or two in which nothing pressing needs to be addressed and you decide to look into the backup rotation and type. After spending a little time looking at it and not feeling comfortable you make the decision to create a secondary backup into the cloud as a test. After a little setup and tinkering you finish up and go on with your daily tasks.

A few months later your onsite storage device hard fails and there is massive data loss. A new system is delivered the same day and once the setup is complete the tapes are delivered and the restore process starts. Three hours into the restore a bad tape is encountered and again you are faced with massive data loss. The entire group is now in panic mode. It suddenly hits you that you setup a test backup offsite. What are the odds that it is still functioning and you will be able to get the data? With help from the entire department you get the network right and the data transfer starts. About one hour later the data is restored and your employees are happy not to mention your boss. You are now an IT hero.


A few months later your onsite storage device hard fails and there is massive data loss. A new system is delivered the same day and once the setup is complete the tapes are delivered and the restore process starts. Three hours into the restore a bad tape is encountered and again you are faced with massive data loss. The entire group is now in panic mode. After many attempts at trying to repair the damaged tape and having multiple experts look at the failed storage device. You and your team realize that 5 days of data will be lost and have to be recreated. Not a great day for your team. You are now an IT failure.

Moral of the story?

Use the tools the world provides to stay ahead of the curve. All it takes is one mistake to be a failure.

December 7, 2009

Availability with NetScaler VPX and Global Load Balancing

The concept Single Point of Failure refers to the fact that somewhere between your clients and your servers there is a single point that if it fails downtime happens. The SPoF can be the server, the network, or the power grid. The dragon Single Point of Failure is always going to be there stalking you; the idea is to push SPoF far enough out to where you have done the best you can with your ability and budget.

At the server level you could combat SPoF by using redundant power supplies and disks. You can also have redundant servers fronted by a load balancer. One of the benefits when using load balancer technology is that the traffic for an application is spread between multiple app servers. You have the ability to take an app server out of rotation for upgrades and maintenance. When you’re done you bring the server back online, the load balancer notices it UP on the next check and the server is back in service.

Using a NetScaler VPX you can even have two groups of servers—one group which generally answer your queries and another group which usually does something else—with the second group functioning as a backup against all of the primary servers for a service having to be taken down through the Backup Virtual Server function.

Result: no Single Point of Failure for the app servers.

What happens if you are load balancing and have to take the load balancer out of service for upgrades or maintenance? Right, now we’ve moved SPoF up a level. One way to handle this is by using the NetScaler VPX product we have at SoftLayer. A pair of VPX instances (NodeA/NodeB) can be teamed in a failover cluster so that if the primary VPX is taken down (either by human action or because the hardware failed) the secondary VPX will begin answering for the IPs within a few seconds and processing the actions. When you bring NodeA back online it slips into the role of secondary until such time as NodeB fails or is taken down. I will note here that VPX instances do have dependency on certain network resources and that dependency can take both VPX instances down.

Result: Loss of a single VPX is not a Single Point of Failure.

So what’s next? A wide-ranging power failure or general network failure of either the frontend or the backend network could render both of the NetScalers in a city unusable or even the entire facility unusable. This can be worked around by having resources in two cities which are able to process queries for your users and by using the Global Load Balancer product we offer. GLB load balances between the cities using DNS results. A power failure taking down Seattle just means your queries go to Dallas instead. Why not skip the VPX layer and just GLB to the app servers? You could, if you don’t have a need for the other functionalities from the VPX.

Result: no single point of failure at the datacenter level

Having redundant functionality between cities takes planning, it takes work, and it takes funding. You have to consider synchronization of content. The web content is easy. Run something like an rsync from time to time. Synching the database content between machines or across cities is a bit more complicated. I’ve seen some customers use the built-in replication capabilities of their database software while others will do a home-grown process such as having their application servers write to multiple database servers. You also have to consider issues of state for your application. Can your application handle bouncing between cities?

Redundancy planning is not always fun but it is required for serious businesses, even if the answer is ultimately to not do any redundancy. People, hardware and processes will fail. Whether a failure event is a nightmare or just an annoyance depends on your preparation.

October 19, 2009

I have backups…Don’t I?

There is some confusion out there on what’s a good way to back up your data. In this article we will go over several options for good ways to backup and sore your backups along with a few ways that are not recommended.

There is some confusion out there on what’s a good way to back up your data. In this article we will go over several options for good ways to backup and sore your backups along with a few ways that are not recommended.

When it comes to backups storing them off site (off your server or on a secondary drive not running your system) is the best solution with storing them off site being the recommended course.

When raids come into consideration just because the drives are redundant (a lave mirror situation) there are several situations, which can cause a complete raid failure such as the raid controller failing, the array developing a bad stripe. Drive failure on more than one drive(this does happen though rarely) , out of date firmware on the drives and the raid card causing errors. Using a network storage device like our evault or a nas storege is also an excellent way to store backups off system. The last thing to consider is keeping your backups up to date. I suggest making a new back every week at minimum (if you have very active sites or data bases I would recommend a every other day backup or daily backup). It is up to you or your server administrator to keep up with your backups and make sure they are kept up to date. If you have a hardware failure and your backups are well out of date it’s almost like not having them at all.

In closing consider the service you provide and how your data is safe, secure, and recoverable. These things I key to running a successful server and website.

June 18, 2008

Planning for Data Center Disasters Doesn’t Have to Cost a Lot of $$

One of the hot topics over the past couple of weeks in our growing industry has been how to minimize downtime should your (or your host’s) data center experience catastrophic failure leading to outages that could span multiple days.

Some will think that it is the host’s responsibility to essentially maintain a spare data center into which they can migrate customers in case of catastrophe. The reason we don’t do this is simple economics. To maintain this type of redundancy, we’d need to charge you at least double our current rates. Because costs begin jumping exponentially instead of linearly as extensive redundancy is added, we’d likely need to charge you more than double our current rates. You know what? Nobody would buy at that point. It would be above the “reservation price” of the market. Go check your old Econ 101 notes for more details.

Given this economic reality, we at SoftLayer provide the infrastructure and tools for you to recover quickly from a catastrophe with minimal cost and downtime. But, every customer must determine which tools to use and build a plan that suits the needs of the business.

One way to do this is to maintain a hot-synched copy of your server at a second of our three geographically diverse locations. Should catastrophe happen to the location of your server, you will stay up and have no downtime. Many of you do this already, even keeping servers at multiple hosts. According to our customer surveys, 61% of our customers use multiple providers for exactly that reason – to minimize business risk.

Now I know what you’re thinking – “why should I maintain double redundancy and double my costs if you won’t do it?” Believe me, I understand this - I realize that your profit margins may not be able to handle a doubling of your costs. That is why SoftLayer provides the infrastructure and tools to provide an affordable alternative to running double infrastructure in multiple locations in case of catastrophe.

SoftLayer’s eVault offering can be a great cost effective alternative to the cost of placing servers in multiple locations. Justin Scott has already blogged about the rich backup features of eVault and how his backup data is in Seattle while his server is in Dallas, so I won’t continue to restate what he has already said. I will add that eVault is available in each of our data centers, so no matter where your server is at SoftLayer, you can work with your sales rep to have your eVault backups in a different location. Thus, for prices that are WAY lower than an extra server (eVault starts at $20/month), you can keep near real-time backups of your server data off site. And because the data transfer between locations happens on SoftLayer’s private network, your data is secure and the transfer doesn’t count toward your bandwidth allotment.

So let’s say your server is in our new Washington DC data center and your eVault backups are kept in one of our Dallas data centers. A terrorist group decides to bomb data centers in the Washington DC area in an attempt to cripple US government infrastructure and our facility is affected and won’t be back up for several days. At this point, you can order a server in Dallas, and once it is provisioned in an hour or so, you restore the eVault backup of your choice, wait on DNS to propagate based on TTL, and you’re rolling again.

Granted, you do experience some downtime with this recovery strategy. But the tradeoff is that you are up and running smoothly after the brief downtime at a cost for this contingency that begins at only $20 per month. And when you factor in your SLA credit on the destroyed server, this offsets the cost of ordering a new server, so the cost of your eVault is the only cost of this recovery plan.

This is much less than doubling your costs with offsite servers to almost guarantee no downtime. The reason that I throw in the word “almost” is that if an asteroid storm takes out all of our locations and your other providers’ locations, you will experience downtime. Significant downtime.


February 4, 2008

Nnet Strikes Back

I'm not going to tell you my name for two reasons: First, I don't want a million tickets assigned to me asking if I'm crazy. Second, if I am crazy, I don't want anyone knowing it's me.

I'm not a writer myself, so I asked Shawn to write this up for me. He's a programmer, and more important a Trekkie, so he's likely to understand (and more important, believe) this story. Besides, he's written a few humorous, slightly preposterous posts for this blog, and that's very, very important.

Unlucky as I am, I was the first person to notice something strange going on. I'm a datacenter tech for the company (but I'm not going to tell you WHICH datacenter), and my job... well, I'm the power guy. I make rounds in the datacenter, checking breakers and power panels, keep an eye on voltages in the portal, that kind of thing. No power issues at the datacenter? That's because of me. So, I'm perusing the tickets and keeping an eye on things, like I should.

As I was answering a particularly interesting ticket, I received an IM from a datacenter engineer I hadn't met yet. That's not surprising; we're growing like crazy here, and I don't always get the "Welcome a new employee" email before I find myself working with the guy or gal. I finished my ticket and opened up the IM window. It was from "Nnet," and the contents caused me to leap out of my seat:

"The power strips on the new racks (205, 206, 207) are drawing too much current; it will pop the breakers in 52 minutes, 12 seconds."

I had just CHECKED those racks. I walked down to the server room, muttering about some whippersnapper of a new engineer playing a trick on me. I was going on vacation in a week, and I did NOT want any power issues; I was training another engineer to take the console while I was gone, and if anything happened during testing I would surely be called in. Anyway, I walked into the server room and checked the gauges on the power panel.

And they were drawing almost a full five amps too much. If we had turned on the third rack, the whole aisle would have gone down. That wouldn't have been too bad; no servers were hooked up. This is exactly why we test the power before we put servers in.

I and the rack crew worked for about an hour rewiring the racks, starting from the third rack. Sure enough, about 52 minutes later, rack 205 shut down. Mentally thanking "Nnet" for finding this (and more importantly, not tinkering with it before letting me know!), we got the racks wired more efficiently (they're supposed to be on separate breakers, but the electrician labeled the wires wrong), reset the breakers, and had absolutely no issues for the rest of the day.

I got back and thanked "Nnet" for finding that issue. The next day, I got to thinking about how "Nnet" had saved my vacation (I would have spent all week tracing wires to figure out what had happened), and I wanted to invite him or her to lunch. So I IMmed "Nnet" with an invitation. An hour went by with no response, but it's not too strange to have a datacenter tech away from their desk for a couple hours. So I sent an email to Nnet.

The email bounced back.

-"Mystery Author"
Maybe HR hadn't set up the email yet? So I called them up to see what was up with Nnet's email address.

That's when HR told me that nobody with the last name "Net" had been hired (I thought "Net" was a strange name for a tech, but it's not the strangest last name I've ever heard). I called the networking department to ask how I could receive a company IM from somebody who doesn't work here? They researched it and couldn't find any incoming links through our firewalls or any of the internal logs. Stranger yet, the Jabber server indeed DOES have an account for "Nnet", but the engineer who runs the server swears that he never set that up.

We were discussing this back and forth when one of the developers walked by, overhearing our conversation. He laughed, and when we asked why, he told us that he was reading a book about the human brain, and that the brain is made up of million of millions of neurons all interconnected with each other; that these interconnected neurons work together to create intelligence.

Could that be true? Absolutely not. It's preposterous. Sure, we've got tens of thousands of computers around here, dual cores and quad cores running various operating systems and applications, all connected by an incredibly fast private network...

...could it be?

The engineers are all completely sure that one of the datacenter techs must be playing a joke, and they're currently tracking it down. But I'm not too convinced. "Nnet" knew which power strips were having trouble in a room keycarded to open only for me and a hand full of other techs. And they all swear they didn't send it.

That's when I talked to Shawn. He told me that there's a lot of technically minded people out there who read fantastic science fiction stories and come up with solutions... even knowing that the tech is impossible, they can find a way to solve the problem. So we hatched up this idea to write out a fantastic blog post, an interesting narrative of my predicament.
Then we'd post it to the blog and watch for any discussion on the customer forums. Our customers are really smart, and they like solving problems. Maybe somebody out there has an idea of how we can figure out what's going on around here.

So here's the story. A completely fantastic modern day science fiction story about a sentient datacenter.


...any ideas?

Subscribe to failure