Posts Tagged 'Backups'

October 28, 2009

Meet Virus Jack

I am Jack’s Vundo Virus. I cause Jack’s computer to have popup. I also disable Windows Automatic Updates, task manager, registry editor, and msconfig so Jack cannot boot to safe mode. I use Jack’s Norton AntiVirus to help spread my infection. I make Jack’s google searches to redirect to rogue antispyware sites. Jack got me by not keeping his system up to date. Now there are programs out there designed to remove me but the best way is for Jack to reformat. Let’s hope he has backups. Morale of this story is keep your computers up to date with the latest OS updates, AntiVirus definitions and program updates.

October 16, 2009

Raid 1 or Raid 0: which should I choose?

When considering these 2 raid options there are a few points you’ll want to consider before making your final choice.

The first to consider is your data, so ask yourself these questions:

  • Is it critical data that your data be recoverable?
  • Do you have backups of your data that can be restored if something happens?
  • Do you want some kind of redundancy and the ability to have a failed drive replaced without your data being destroyed?

If you have answered yes to most of these, you are going to want to look at a Raid 1 configuration. With a Raid 1 you have 2 drives of like size matched together in an array, which consists of an active drive and a mirror drive. Either of these drives can be replaced should one go bad without any loss of data and without taking the server offline. Of course, this assumes that the Raid card that you are using is up to date on it’s firmware and supports hot swapping.

If you answered no to most of these questions other than the backup question (you should always have backups), a Raid 0 set-up is probably sufficient. This is used mostly for disk access speeds and does not contain any form of redundancy or failover. If you have a drive failure while using a Raid 0 your data will be lost 99% of the time. This is an unsafe Raid method and should only be used when the data contained on the array is not critical in anyway. Unfortunately with this solution there is no other course of action that can be taken other than replacing the drives and rebuilding a fresh array.

I hope this helps to clear up some of the confusion regarding these 2 Raid options. There are several other levels of Raid which I would suggest fully researching before you consider using one of them.

Categories: 
May 15, 2009

Disaster Recovery Plan

A few days ago I was reading a news story about a man who just lost everything to a fire. One of the comments he made was that he had never thought to plan for something like this; it was the type of thing that happened to other people but never to me. I started thinking about how true that statement was. Many people just never think it will happen to them.

This type of situation happens every day in the IT field. There is some sort of disaster causing a server to crash or simply stop working all together, the drives on the server are completely corrupted and the data is just gone. The question is; when this happens to you, will you be prepared? Thankfully, there are steps each person can take to limit the pain and downtime a situation like this can cause. Like any other disaster recovery plan, the more you are willing to put into it, the more protection you will have when disaster strikes.

This is where SoftLayer comes in. Here at SoftLayer we understand the importants of providing our customers the means to create a good disaster recovery plan that meets their needs. We understand that a detailed disaster recovery plan will include things such as backups and replication. Our services such as NAS and EVault are perfect solutions for performing and managing the backups for you server. When looking into replication, we offer services such as iSCSI replication, Raids, local and global loadbalancing which will provide our customer with the tools to replicate not only their data across multiple locations but their servers as well. Above all, we provide our private network to securely transfer this data to the many locations without impacting the traffic on your public network.

We can only hope that on the day disaster strikes, everyone has some plan in place to deal with it. There is nothing more frustrating in this industry then the loss of crucial data that in many instances cannot be recovered.p

June 20, 2008

I Always Have a Backup Plan

It was the day of the big secret meeting. All my vice presidents were there except for the unix system administrator. He was a strange man, always wearing that robe, with the long beard and long hair. He considered himself some sort of wizard, and after the conflict last month when we decided to switch all our servers over to SoftLayer, I really didn’t want him involved in the meeting I called today. You see, I called it so I could announce my plan to switch our servers over to Windows. My goal was really to get rid of him; he’s the only one who ever managed to thwart my plans.

Just as I finished that thought, he burst through the door, trailing a long ribbon of old-fashioned printer paper behind him. “How dare you have a systems meeting without me!” he intoned, dropping his stack of papers on the conference table in front of me. A quick glance at the stack tells me that he has printed out operating statistics for every version of Unix and every version of Windows going back to 1985. I didn’t have time for this. Luckily, I always have a back up plan.

Turning away slightly, I quickly activated a program on my Blackberry. You see, yesterday I had written a few custom programs that utilize the SoftLayer API to control a variety of our services. Within moments, a confirmation had appeared on my screen. All of our web traffic had been redirected from our load balanced main servers to our tertiary backup server. In the middle of the work day, that means it was only a matter of minutes before our bandwidth would be exceeded on that server. I allowed the sysadmin to begin his presentation, confident that he would barely get past the 8086 before disaster stuck.

I was right! Within minutes, an email arrived notifying us that we were nearing the bandwidth cap on the hostname last_resort. Panicked, the sysadmin left the meeting. Quickly I summarized my plans to the other VPs, we all voted unanimously for Windows, and I retreated to my office. Shortly after sitting behind my desk, my door burst open. Framed in the light from the hallway, his long shadow washing over me, stood the sysadmin, slowly twirling his staff. “Do you think you can stop me with a simple change to our load balancer? I was configuring load balancers when you were still on dial-up! Now, you will listen, AOL user, and you will see why Unix is your only choice!” Of course, I had a backup plan for just such a situation.

I dove out the window next to my desk, landing nimbly next to my secretary’s bright pink LeBaron. I had made copies of all her keys months ago in order to utilize her unique vehicle for any necessary escapes. I quickly tapped out a text message to Michael in SoftLayer sales. We have a standing agreement that when he receives a message from me containing only the word DAWT, he is to send the best sale at his disposal to my sysadmin. As I drove past the front door of the building I saw him running toward the car. He pulled out his Blackberry in mid-stride and suddenly stopped dead. “Free double RAM AND double hard drives!? IMPOSSIBLE!” he screamed, and I managed to swerve around him and escape. As I drove away, I thought about my secretary. When she first started here, I had convinced her that if her car were ever stolen, the best plan of action would be to change the building security policies so that only my badge could open the doors. I hoped I didn’t need to make use of that plan, but the sysadmin has proved a worthy adversary.

Unbelievable! Even with my masterful backup plan, he was still following me. I saw his battered VW Bus merge into traffic behind me, his vulture-like shadow looming behind the wheel. I sped up until we were both racing down the road, weaving in and out of the other vehicles. Finally we passed a police car, and my next plan sprang into action. I knew that standard procedure was to radio in the vehicles you were pursuing, and I knew my friend Joe was on duty today. Joe knew that if he ever received a radio call about a business man in a pink LeBaron being chased down the highway by a wizard in a VW Bus, he was to call off the police and park a fire truck at a certain intersection. You see, I had hired an actor to pretend to be a corporate Psychiatrist, and learned that the Sysadmin had an irrational fear of fire trucks. Why? Because it always pays to have a backup plan.

I angled toward the intersection and managed to squeeze past the truck just as it pulled up to block the street. I heard the squeal of tires as the sysadmin slammed on his breaks and reversed wildly behind me. Now that I was free, however, I couldn’t return to the office. Luckily I was prepared for just such an eventuality. As I drove to my next location, I quickly used my Blackberry to shut down one of our production web servers. I knew that it would be 20 minutes before the monitoring system would officially declare the server “down,” so I had time.

I made it to my secret office above the video arcade not long after. Before leaving the car I collected the grappling hook and rope from a secret compartment in the door, then went inside. I walked in to the darkened room and immediately noticed something was wrong. My security system wasn’t beeping! The door slammed behind me and the sysadmin boomed out “NO PLAN CAN DEFEAT ME, MORTAL!”

“I’m ALWAYS prepared!” I shot back, and quickly glanced at my watch. It had been 19 minutes and 45 seconds since I shut down my server, the timing was perfect! The sysadmin walked toward me, twirling that staff. Just as he was about to reach me, his blackberry beeped. Pausing to check, he let out a stream of curses and then lunged at me, but I had already rappelled down the side of the building and made my escape.

As soon as I reached the car, my Blackberry alerted me that the server I shut down was back up. How!? The sysadmin must have his own API programs! I cringed as I activated my final backup plan: a program that constantly shut down all our servers. Let’s see him handle that! I took the direct route back to the office, past the still-idling fire truck. I threw Joe a wave, knowing that I’d owe him a big favor for this, and rocketed back to the office. I knew that he would be right behind me, but hopefully with all our servers offline he won’t beat me to my destination. Also, once I made it into the building, the security system wouldn’t allow anyone in behind me. I would be safe!

I raced into the building, looking frantically around for the sysadmin, but he was nowhere to be seen. Finally! I had defeated him! I walked calmly to my office and opened the door, only to see HIM, climbing in through my window. I had forgotten to close it when I escaped this morning! I quickly opened the secret panel in the wall next to the door and put my finger on the red button.

“WAIT!” cried the sysadmin. “We need to put our differences behind us. Our plans have almost destroyed our servers!”

“What do you mean?” I demanded. “They’re fine!”

“No, they’re not,” he said in a sad voice. “You see, I always have a backup plan, and I knew that eventually someone would attempt to power off our machines, so I wrote a script to constantly turn the machines on!”

“B-but…” I stammered, “but I wrote a script to constantly turn them OFF”

“I know” he said, “and the constant power cycling has corrupted our data base. We need to set aside this silly feud and fix it.”

“Don’t worry, dear end user” I proudly proclaimed, “I always have a backup-“

It was right then I realized that in all my planning, I had never actually created any backups.

-Daniel

Categories: 
June 18, 2008

Planning for Data Center Disasters Doesn’t Have to Cost a Lot of $$

One of the hot topics over the past couple of weeks in our growing industry has been how to minimize downtime should your (or your host’s) data center experience catastrophic failure leading to outages that could span multiple days.

Some will think that it is the host’s responsibility to essentially maintain a spare data center into which they can migrate customers in case of catastrophe. The reason we don’t do this is simple economics. To maintain this type of redundancy, we’d need to charge you at least double our current rates. Because costs begin jumping exponentially instead of linearly as extensive redundancy is added, we’d likely need to charge you more than double our current rates. You know what? Nobody would buy at that point. It would be above the “reservation price” of the market. Go check your old Econ 101 notes for more details.

Given this economic reality, we at SoftLayer provide the infrastructure and tools for you to recover quickly from a catastrophe with minimal cost and downtime. But, every customer must determine which tools to use and build a plan that suits the needs of the business.

One way to do this is to maintain a hot-synched copy of your server at a second of our three geographically diverse locations. Should catastrophe happen to the location of your server, you will stay up and have no downtime. Many of you do this already, even keeping servers at multiple hosts. According to our customer surveys, 61% of our customers use multiple providers for exactly that reason – to minimize business risk.

Now I know what you’re thinking – “why should I maintain double redundancy and double my costs if you won’t do it?” Believe me, I understand this - I realize that your profit margins may not be able to handle a doubling of your costs. That is why SoftLayer provides the infrastructure and tools to provide an affordable alternative to running double infrastructure in multiple locations in case of catastrophe.

SoftLayer’s eVault offering can be a great cost effective alternative to the cost of placing servers in multiple locations. Justin Scott has already blogged about the rich backup features of eVault and how his backup data is in Seattle while his server is in Dallas, so I won’t continue to restate what he has already said. I will add that eVault is available in each of our data centers, so no matter where your server is at SoftLayer, you can work with your sales rep to have your eVault backups in a different location. Thus, for prices that are WAY lower than an extra server (eVault starts at $20/month), you can keep near real-time backups of your server data off site. And because the data transfer between locations happens on SoftLayer’s private network, your data is secure and the transfer doesn’t count toward your bandwidth allotment.

So let’s say your server is in our new Washington DC data center and your eVault backups are kept in one of our Dallas data centers. A terrorist group decides to bomb data centers in the Washington DC area in an attempt to cripple US government infrastructure and our facility is affected and won’t be back up for several days. At this point, you can order a server in Dallas, and once it is provisioned in an hour or so, you restore the eVault backup of your choice, wait on DNS to propagate based on TTL, and you’re rolling again.

Granted, you do experience some downtime with this recovery strategy. But the tradeoff is that you are up and running smoothly after the brief downtime at a cost for this contingency that begins at only $20 per month. And when you factor in your SLA credit on the destroyed server, this offsets the cost of ordering a new server, so the cost of your eVault is the only cost of this recovery plan.

This is much less than doubling your costs with offsite servers to almost guarantee no downtime. The reason that I throw in the word “almost” is that if an asteroid storm takes out all of our locations and your other providers’ locations, you will experience downtime. Significant downtime.

-Gary

June 14, 2008

In Memory of Dawn

Dawn was the best friend I’ve ever had, except for my little sister. Just yesterday I got home only to find out that Dawn had died silently in the night. No amount of resuscitation could bring her back. Needless to say, I was quite sad.

Dawn was my computer.*

The funny part of it all was just how much of my time involves a computer. I watch TV and Movies on my computer, I play games on my computer, I do my banking on my computer, I pay all my bills on my computer, I schedule my non-computer time on my computer, I use my computer as a jukebox.

In other words, I was completely lost. What made it worse, however, was that I had had yesterday scheduled to pay my bills. But where was my list of bills?

If you guessed “Dawn had all your bills”, then you are right.

What about paper bills? I’ve got the Internet and a computer! So, in most cases I’ve canceled paper bills. All paper bills I get are shredded forthwith. So I had no paper backup of bills.

Well, I made do. I kicked my roommate off his computer (a technique involving making annoying noises while he tries to concentrate playing Call of Duty 4) and used it to pay what bills I could remember. I kept track of the bills I was paying by entering them into a Google Document.

That’s when it hit me! Why wasn’t my bill spreadsheet on Google Documents? Along with my bill list? Along with all the other documents I work on every day? Cloud Computing For The Win! As soon as I get my next computer up and running (and I figure out a new naming algorithm) I’m going to put all my vital files on Google Docs. This ties in well with Justin Scott’s post; the key to not having your data disappear during a disaster is to have a backup copy. You want backups out there, far away from your potential point of failure. (I did have backups… but they’re all on CDs that I didn’t want to have to sort through to find just one file. And had the disaster been, say, a flood, I would have had no backups.)

Google Docs is a great example of Cloud Computing: Putting both the program and the file being worked on “in the cloud.” Having built internal applications for a few people, I would make the same recommendation: Since many business apps are moving to PHP anyway (thanks for the reminder, Daniel!), you might as well move the application AND the data out of the building and onto a secure server. And as Mr. Scott** mentioned, SoftLayer ALREADY has geographic diversity as well as a private network that will allow you to link your application and data servers together in real time through all datacenters… for free. Along with the added bonus of being able to access your application from any computer… should yours meet up with Misty, May, and Dawn at the Great Datacenter in the Sky.

-Zoey

* I had a system of naming my computers after the female protagonists from the Pokemon series. Dawn, however, is the last of that series…

** I’ve decided that since Justin is an Engineer, calling him Mr. Scott is funny.

June 4, 2008

Wait … Back up. I Missed Something!

I’ve been around computers all my life (OK, since 1977 but that’s almost all my life) and was lucky to get my first computer in 1983.

Over the summer of 1984, I was deeply embroiled in (up to that point) the largest programming project of my life, coding Z80 ASM on my trusty CP/M computer when I encountered the most dreaded of all BDOS errors, “BDOS ERROR ON B: BAD SECTOR”

In its most mild form, this cryptic message simply means “copy this data to another disk before this one fails.” However, in this specific instance, it represented the most severe case… “this disk is toast, kaputt, finito, your data is GONE!!!”

Via the School of Hard Knocks, I learned the value of keeping proper backups that day.
If you’ve been in this game for longer than about 10 milliseconds, it’s probable that you’ve experienced data loss in one form or another. Over the years, I’ve seen just about every kind of data loss imaginable, from the 1980’s accountant who tacked her data floppy to the filing cabinet with a magnet so she wouldn’t misplace it-- all the way to enterprise/mainframe class SAN equipment that pulverizes terabytes of critical data in less than a heartbeat due to operator error on the part of a contractor.

I’ve consulted with thousands of individuals and companies about their backup implementations and strategies, and am no longer surprised by administrators who believe they have a foolproof backup utilizing a secondary hard disk in their systems. I have witnessed disk controller failures which corrupt the contents of all attached disk drives, operator error and/or forgetfulness that leave gaping holes in so-called backup strategies and other random disasters. On the other side of the coin, I have personally experienced tragic media failure from “traditional backups” utilizing removable media such as tapes and/or CD/DVD/etc.

Your data is your life. I’ve waited up until this point to mention this, because it should be painfully obvious to every administrator, but in my experience the mentality is along the lines of “My data exists, therefore it is safe.” What happens when your data ceases to exist, and you become aware of the flaws in your backup plan? I’ll tell you – you go bankrupt, you go out of business, you get sued, you lose your job, you go homeless, and so-on. Sure, maybe those things won’t happen to you, but is your livelihood worth the gamble?

“But Justin… my data is safe because it’s stored on a RAID mirror!” I disagree. Your data is AVAILABLE, your data is FAULT TOLERANT, but it is not SAFE. RAID controllers fail. Disaster happens. Disgruntled or improperly trained personnel type ‘rm –rf /’ or accidentally select the wrong physical device when working with the Disk Manager in Windows. Mistakes happen. The unforeseeable, unavoidable, unthinkable happens.

Safe data is geographically diverse data. Safe data is up-to-date data. Safe data is readily retrievable data. Safe data is more than a single point-in-time instance.

Unsafe data is “all your eggs in one basket.” Unsafe data is “I’ll get around to doing that backup tomorrow.” Unsafe data is “I stored the backups at my house which is also underwater now.” Unsafe data is “I only have yesterday’s backup and last week’s backup, and this data disappeared two days ago.”

SoftLayer’s customers are privileged to have the option to build a truly safe data backup strategy by employing the Evault option on StorageLayer. This solution provides instantaneous off-site backups and efficiently utilizes tight compression and block-level delta technologies, is fully automated, has an extremely flexible retention policy system permitting multiple tiers of recovery points-in-time, is always online via our very sophisticated private network for speedy recovery, and most importantly—is incredibly economical for the value it provides. To really pour on the industry-speak acronym soup, it gives the customer the tools for their BCP to provide a DR scenario with the fastest RTO with the best RPO that any CAB would approve because of its obvious TCR (Total Cost of Recovery). Ok, so I made that last one up… but if you don’t recover from data loss, what does it cost you?

On my personal server, I utilize this offering to protect more than 22 GB of data. It backs up my entire server daily, keeping no less than seven daily copies representing at least one week of data. It backs up my databases hourly, keeping no less than 72 hourly copies representing at least three days of data. It does all this seamlessly, in the background, and emails me when it is successful or if there is an issue.

Most importantly, it keeps my data safe in Seattle, while my server is located in Dallas. Alternatively, if my server were located in Seattle, I could choose for my data to be stored in Dallas or our new Washington DC facility. Here’s the kicker, though. It provides me the ability to have this level of protection, with all the bells and whistles mentioned above, without overstepping the boundary of my 10 GB service. That’s right, I have 72 copies of my database and 7 copies of my server, of which the original data totals in excess of 22 GB, stored within 10 GB on the backup server.

That’s more than sufficient for my needs, but I could retain weekly data or monthly data without significant increase in storage requirements, due to the nature of my dataset.
This service costs a mere $20/mo, or $240/yr. How much would you expect to pay to be able to sleep at night, knowing your data is safe?

Are you missing something? Wait … Backup!

-Justin

February 27, 2008

Hardwhere?

It’s a fact -- all software ends up relying on a piece of hardware at some point. And hardware can fail. But the secret is to create redundancy to minimize the impact if hardware does fail.
RAIDS, load balancers, redundant power supplies, cloud computing - the list goes on. And we support them all. Many of these options are not mandatory, but I wish they were! That’s where the customer comes in – it is critical to understand the value of the application and data sitting on the hardware and set a redundancy and recovery plan that fits.

Keep your DATA safe:

  • RAID - For starters *everyone* should have a RAID 1, 5, or 10. This keeps your server online in the event of a drive failure.

The best approach – RAID 10 all the way. You get the benefits of a RAID 0 (striping across 2 drives so you get the data almost twice as fast) and the security of RAID 1 (mirroring data on 2 separate drives) all rolled into one. I think every server should have this as a default.

  • Separate Backups – EVault Backup, ISCSI Storage, FTP/NAS Storage, your own NAS server or just a different server. Lose data just once (or have the ability to recover it painlessly) and these will pay for themselves. Remember, hardware is not the only way in which you can lose data -– hackers, software failures, and human error will always be a risk.

StorageLayer. Use it or lose it.

Going further:

  • Redundant servers in different locations – spread your servers out across different datacenters and use a load balancer. Nothing is safer than a duplicate server 1000’s of miles away. That’s why we have invested in a second data center – to keep your data and business safe.

Check 'em out in our Services > Network Services section.

The future:

  • Solid state drives – aww yeah baby. They are coming.

Solid state drives are just that – a drive with no moving parts. No more platters or read/write heads. I mean come on, hard drives are essentially using the same basics that old record players use. CD’s use this technology too. And you see where those went (can you say iPod? I prefer my iPod touch. I have never had an iPod until now so I skipped right to the new fancy pants model. Can you tell I just got it?).

Check out these comparison tests of solid state drives vs. conventional ones:

  • Faster, faster, faster! –- Processors, memory, drives, network -- everything is getting much faster. And in part by redundancy (dual and quad core processors, dual and quad processor motherboards). See? Redundancy is the way of the future!

We have 4 Intel Xeon Quadcore Tigertown processors on one motherboard. That’s 16 processors on one server! Shazam!

  • Robot DC patrol sharks – yep. Got the plans on my desk right now. But I can’t take all the credit, Josh R. suggested this one, I just make things happen.

I work to keep all of our hardware running in tip top condition. But I look at the bigger picture when it comes to hardware – how to completely eliminate the impact of any hardware issue. That’s why I suggest all the redundancies listed above. While I can reduce the probability of hardware issues with testing, monitoring of firmware updates, proper handling procedures, choosing quality components, etc., redundancy is the ultimate solution to invisible hardware.

Hardwhere?, if you will.

-Brad

February 11, 2008

Spares at the Ready

In Steve's last post he talked about the logic of outsourcing. The rationale included the cost of redundant internet connections, the cost of the server, UPS, small AC, etc. He covers a lot of good reasons to get the server out of the broom closet and into a real datacenter. However, I would like to add one more often over looked component to that argument: the Spares Kit.

Let's say that you do purchase your own server and you set it up in the broom closet (or a real datacenter for that matter) and you get the necessary power, cooling and internet connectivity for it. What about spare parts?

If you lose a hard drive on that server, do you have a spare one available for replacement? Maybe so - that's a common part with mechanical features that is liable to fail - so you might have that covered. Not only do you have a spare drive, the server is configured with some level of RAID so you're probably well covered there.

What if that RAID card fails? It happens - and it happens with all different brands of cards.

What about RAM? Do you keep a spare RAM DIMM handy or if you see failures on one stick, do you just plan to remove it and run with less RAM until you can get more on site? The application might run slower because it's memory starved or because now your memory is not interleaved - but that might be a risk you are willing to take.

How about a power supply? Do you keep an extra one of those handy? Maybe you keep a spare. Or, you have dual power supplies. Are those power supplies plugged into separate power strips on separate circuits backed up by separate UPSs?

What if the NIC on the motherboard gets flaky or goes out completely? Do you keep a spare motherboard handy?

If you rely on out of band management of your server via an IPMI, Lights Out or DRAC card - what happens if that card goes bad while you're on vacation?

Even if you have all necessary spare parts for your server or you have multiple servers in a load balanced configuration inside the broom closet; what happens if you lose your switch or your load balancer or your router or your... What happens if that little AC you purchased shuts down on Friday night and the broom closet heats up all weekend until the server overheats? Do you have temperature sensors in the closet that are configured to send you an alert - so that now you have to drive back to the office to empty the water pail of the spot cooler?

You might think that some of these scenarios are a bit far fetched but I can certainly assure you that they're not. At SoftLayer, we have spares of everything. We maintain hundreds of servers in inventory at all times, we maintain a completely stocked inventory room full of critical components, and we staff it all 24/7 and back it all up with a 4 hour SLA.

Some people do have all of their bases covered. Some people are willing to take a chance, and even if you convince your employer that it's ok to take those chances, how do you think the boss will respond when something actually happens and critical services are offline?

-SamF

January 30, 2008

That's SMART

My grandmother used to say an ounce of prevention is worth a pound of cure. Usually this was her polite way of telling me to pick my skateboard up off the stairs before she stepped on it and broke her neck or to put a sheet of newspaper over her antique kitchen table before I began refueling my model airplane. All very sound advice looking back. And now here I find myself repeating the same adage some twenty years later in the context of predicting mechanical drive failure. An ounce of prevention is worth a pound of cure.

Hard disk drive manufacturers recognized both the reality and the advantages of being able to predict normal hard disk failures associated with drive degradation sometime around 2003. This led a number of leading hard disk makers to collaborate on a standard which eventually became known as SMART. This acronym stands for Self-Monitoring, Analysis and Reporting Technology and when used properly is a formidable weapon in any system administrator's arsenal.

The basic concept is that firmware on the hard disk itself will record and report key "attributes" of that drive which when monitored and analyzed over time can be used to predict and avoid catastrophic hard disk failures. Anyone who has been around computers for more than a day knows the terrible feeling that manifests in the pit of your stomach when it becomes apparent that your server or workstation will not boot because the hard disk has cratered. Luckily, we ALL of course back up our hard drives daily! Right?

All kidding aside even with a recent back up just the task of restoring and getting your system back in working order is a serious hassle and it’s not something you get the luxury of scheduling if the machine is critical to operations and failed in the middle of your work day or worse yet, the middle of your beauty sleep. That is where SMART comes in. When properly used SMART data can give “clues” that a drive is reaching a failure point--prior to it failing. This in turns means you can schedule a drive cloning and replacement within your next regular maintenance window. Really aside from a hard disk that lasts forever what more could an administrator ask for?

SMART drive data has been described as a jigsaw puzzle. That's because it takes monitoring a myriad of data points consistently over time to be able to put together a picture of your hard disk health. The idea is that an administrator regularly records and analyzes characteristics about the installed spinning media and looks for early warning signs that something is going wrong. While different drives have different data points, some of the key and most common attributes are:

  • head flying height
  • data throughput performance
  • spin-up time
  • re-allocated sector count
  • seek error rate
  • seek time performance
  • spin try recount
  • drive calibration retry count

These items are considered typical drive health indicators and should be base-lined at drive installation and then monitored for significant degradation. While the experts still disagree on the exact value of SMART data analysis, I have seen sources that claim at least 30% of drive failures can be detected some 60 days prior to the actual failure through the monitoring of SMART data.

Of course not all drive failures can be predicted. Plus some failures are caused by factors other than drive degradation. Consider drives damaged by power surges or drives that are dropped in shipping as good examples of drive failures that cannot normally be detected through SMART monitoring. However in my humble opinion even one hard disk failure prevented over the course of my career is something to celebrate--unless you happen to own stock in McNeil Consumer Healthcare, a.k.a. the distributors of Tylenol!

So what does this have to do with SoftLayer? Well I am certainly not claiming that SoftLayer is going to predict all your hard drive disasters so there is no reason for you to back up your data. In fact, I recommend not just backing it up but backing it up in geographically disparate locations (did I mention we have data centers in Dallas and Seattle?). What I do mean to share is that technologies like SMART data are just one of the many ways SoftLayer is currently investigating to improve what is already the best hosting company in the business.

I should know. I was tasked with writing the low-level software to extract this data. That’s right. SoftLayer has engineers working at the application layer, down at the device driver layer, and everywhere in between. If that doesn’t give you a warm fuzzy about your hosting company, I don’t know what will.

-William

Subscribe to backups