Posts Tagged 'Drive Failures'

January 30, 2008

That's SMART

My grandmother used to say an ounce of prevention is worth a pound of cure. Usually this was her polite way of telling me to pick my skateboard up off the stairs before she stepped on it and broke her neck or to put a sheet of newspaper over her antique kitchen table before I began refueling my model airplane. All very sound advice looking back. And now here I find myself repeating the same adage some twenty years later in the context of predicting mechanical drive failure. An ounce of prevention is worth a pound of cure.

Hard disk drive manufacturers recognized both the reality and the advantages of being able to predict normal hard disk failures associated with drive degradation sometime around 2003. This led a number of leading hard disk makers to collaborate on a standard which eventually became known as SMART. This acronym stands for Self-Monitoring, Analysis and Reporting Technology and when used properly is a formidable weapon in any system administrator's arsenal.

The basic concept is that firmware on the hard disk itself will record and report key "attributes" of that drive which when monitored and analyzed over time can be used to predict and avoid catastrophic hard disk failures. Anyone who has been around computers for more than a day knows the terrible feeling that manifests in the pit of your stomach when it becomes apparent that your server or workstation will not boot because the hard disk has cratered. Luckily, we ALL of course back up our hard drives daily! Right?

All kidding aside even with a recent back up just the task of restoring and getting your system back in working order is a serious hassle and it’s not something you get the luxury of scheduling if the machine is critical to operations and failed in the middle of your work day or worse yet, the middle of your beauty sleep. That is where SMART comes in. When properly used SMART data can give “clues” that a drive is reaching a failure point--prior to it failing. This in turns means you can schedule a drive cloning and replacement within your next regular maintenance window. Really aside from a hard disk that lasts forever what more could an administrator ask for?

SMART drive data has been described as a jigsaw puzzle. That's because it takes monitoring a myriad of data points consistently over time to be able to put together a picture of your hard disk health. The idea is that an administrator regularly records and analyzes characteristics about the installed spinning media and looks for early warning signs that something is going wrong. While different drives have different data points, some of the key and most common attributes are:

  • head flying height
  • data throughput performance
  • spin-up time
  • re-allocated sector count
  • seek error rate
  • seek time performance
  • spin try recount
  • drive calibration retry count

These items are considered typical drive health indicators and should be base-lined at drive installation and then monitored for significant degradation. While the experts still disagree on the exact value of SMART data analysis, I have seen sources that claim at least 30% of drive failures can be detected some 60 days prior to the actual failure through the monitoring of SMART data.

Of course not all drive failures can be predicted. Plus some failures are caused by factors other than drive degradation. Consider drives damaged by power surges or drives that are dropped in shipping as good examples of drive failures that cannot normally be detected through SMART monitoring. However in my humble opinion even one hard disk failure prevented over the course of my career is something to celebrate--unless you happen to own stock in McNeil Consumer Healthcare, a.k.a. the distributors of Tylenol!

So what does this have to do with SoftLayer? Well I am certainly not claiming that SoftLayer is going to predict all your hard drive disasters so there is no reason for you to back up your data. In fact, I recommend not just backing it up but backing it up in geographically disparate locations (did I mention we have data centers in Dallas and Seattle?). What I do mean to share is that technologies like SMART data are just one of the many ways SoftLayer is currently investigating to improve what is already the best hosting company in the business.

I should know. I was tasked with writing the low-level software to extract this data. That’s right. SoftLayer has engineers working at the application layer, down at the device driver layer, and everywhere in between. If that doesn’t give you a warm fuzzy about your hosting company, I don’t know what will.

-William

Subscribe to drive-failures