Posts Tagged 'Downtime'

May 18, 2011

Panopta: Tech Partner Spotlight

This is a guest blog from Jason Abate of Panopta, a SoftLayer Tech Marketplace Partner specializing in monitoring your servers and managing outages with tools and resources designed to help minimize the impact of outages to your online business.

5 Server Monitoring Best Practices

Prior to starting Panopta, I was responsible for the technology and operations side of a major international hosting company and worked with a number of large online businesses. During this time, I saw my share of major disasters and near catastrophes and had a chance to study what works and what doesn't when Murphy's Law inevitably hits.

Monitoring is a key component of any serious online infrastructure, and there are a wide range of options when it comes to monitoring tools — from commercial and open-source software that you install and manage locally to monitoring services like Panopta. The best solution depends on a number of criteria, but there are five major factors to consider when making this decision.

1. Get the Most Accurate View of Your Infrastructure
Accuracy is a dual-edged sword when it comes to monitoring that can hurt you in two different ways. Check too infrequently and you'll miss outages entirely, making you think that things are rosy when your customers or visitors are actually encountering problems. There are tools that check every 30 minutes or more, but these are useless to real production sites. You should make sure that you can perform a complete check of your systems every 60 seconds so that small problems aren't overlooked.

I've seen many people setup this high-resolution monitoring only to be hit with a barrage of alerts for frequent short-lived problems which were previously never detected. It may hurt to find this, but at least with information about the problem you can fix it once and for all.

The flip side to accuracy is that your monitoring system needs to verify outages to ensure they are real in order to avoid sending out false alerts. There's no faster way to train an operations team to ignore the monitoring system than with false alerts. You want your team to jump at alerts when they come in.

High-frequency checks that are confirmed from multiple physical locations will ensure you get the most accurate view of your infrastructure possible.

2. Monitor Every Component of Your Infrastructure
There are lots of components that make up a modern website or application, and any of them could break at any time. You need to make sure that you're watching all of these pieces, whether they're inside your firewall or outside. Lots of monitoring providers focus purely on remotely accessible network services, which are important but only one half of the picture. You also want an inside view of how your server's resources are being consumed, and how internal-only network devices (such as backend database servers) are performing.

Completeness also means that it's economically feasible to watch everything. If the pricing structure of your monitoring tool is setup in a way that makes it cost prohibitive to watch everything then the value of your monitoring setup is greatly diminished. The last thing you want to run into when troubleshooting a complex problem is to find that you don't have data about one crucial server because you weren't monitoring it.

Make sure your monitoring system is able to handle all of your server and network components and gives you a complete view of your infrastructure.

3.Notify the Right People at the Right Time
You know when the pager beeps or the phone rings about an outage, your heart beats a little faster. Of course, it's usually in the middle of the night and you're sleeping right?! As painful as it may be, you want your monitoring system to get you up when things are really hitting the fan - it's still better than hearing from angry customers (and bosses!) the next morning.

However, not all outages are created equally and you may not want to be woken up when one of your clustered webservers briefly goes down and then corrects itself a few minutes later. The key to a successful monitoring solution is to have plenty of flexibility in your notification setup including being able to setup different notification types based on the criticality of the service.

You also want to be able to escalate a problem, bringing in additional resources for long-running problems. This way outages don't go unnoticed for hours while the on-call admin who perpetually sleeps through pages gets more shut-eye.

Make sure that when it comes to notification, your monitoring system is able to work with your team's preferred setup, not the other way around.

4. Don't Just Detect Problems, Streamline Fixing Them
Sending out alerts about a problem is important, but it's just the first step in getting things back to normal. Ideally after being alerted an admin can jump in and solve whatever the problem is and life goes on. All too often though, things don't go this smoothly.

You've probably run into situations where an on-call admin is up most of the night with a problem. That's great, but when the rest of the team comes in the next morning they have no idea what was done. What if the problem comes up again? Are there important updates that need to be deployed to other servers?

Or maybe you have a big problem that attracts interest from your call center and support staff (your monitoring system did alert you before they walked up, right?) Or management from other departments interrupt to get updates on the problem so they can head off a possible PR disaster.

These are important to the operation of your business, but they pull administrators away from actually solving the problem, which just makes things worse. There should be a better way to handle these situations. Given it's central role in your infrastructure management, your monitoring system is in a great position to help streamline the problem solving process.

Make sure your monitoring system gives you tools to keep everyone on the same page by letting everyone easily communicate and log what was ultimately done to resolve the problem.

5. Demonstrate how Your Infrastructure is Performing
Your role as an administrator is to keep your infrastructure up and running. It's unfortunately a tough spot to be in - do your job really well and no one notices. But mess up, and it's clearly visible to everyone.

Solid reporting capabilities from your monitoring system give you a tool to help balance this situation. Be sure to get summary reports that can demonstrate how well things are running or make the argument for making changes and then following up to show progress. Availability reports also let you see a "big picture" view of how your infrastructure is performing that often gets lost in the chaos of day-to-day operations.

Detailed reporting gives you the data you need to accurately assess and promote the health of your infrastructure.

The Panopta Difference
There are quite a few options available for monitoring your servers, each of which come with trade offs. We've designed Panopta to focus on these five criteria, and having built on top of SoftLayer's infrastructure from the very beginning are excited to be a part of the SoftLayer Technology Marketplace.

I would encourage you to try out Panopta and other solutions and see which is the best fit to the specific requirements for your infrastructure and your team - you'll appreciate what a good night's sleep feels like when you don't have to worry about whether your infrastructure is up and running.

-Jason Abate, Panopta

This guest blog series highlights companies in SoftLayer's Technology Partners Marketplace.
These Partners have built their businesses on the SoftLayer Platform, and we're excited for them to tell their stories. New Partners will be added to the Marketplace each month, so stay tuned for many more come.
March 7, 2011

March Madness - Customer Experience Style

If you are a SoftLayer customer you probably noticed a maintenance window early Sunday morning. If you aren't a SoftLayer customer, (you should be, and) you may have even noticed on quite a few social media outlets that we were trying to provide real-time updates about the maintenance progress, and our customers were doing so as well.

SoftLayer customers were given two internal tickets notifying them if they were to be affected, and when those tickets were created, the ticket system would have then sent an email to the admin user on that account. Additionally, our portal notification system was updated to show details about the window, and we created new threads in our customer forums to provide regular, centralized updates. We went as far as taking a few calls and meetings with customers to talk about their concerns with the maintenance timing and length because we know that any downtime is bad downtime in the world of hosting.

Saturday night, we had extra support on staff online, and our social media ninja was awake and letting the world know step by step what we were doing with real time status alerts. We wanted to be extremely transparent during the entire process. This was not a maintenance we could avoid, and we tried to roll as many different things that needed work into this maintenance without making a roll back impossible.

The maintenance itself went well, and as planned, most items that were taken down were back online well before the window ended. We ran into a few snags in bringing all of the CloudLayer CCIs back online, but even with those delays for a few customers, the work was completed by the time we committed to.

Now for the customer experience aspect. From reading various tweets from our customers, it seems like we should/could have done a few things even better: Been more proactive, sent standard email, attempted phone calls, etc.

While some of these options may be considered, not all are feasible. If you are one of the customers that tweeted, has blogged, is planning on tweeting, is planning on blogging or believes we're being anything less than genuine and transparent on our social media platforms, I want to hear from you.

Please comment on this blog, tweet me @skinman454, email me skinman@softlayer.com, call me at 214.442.0592, come by our office and visit.

Whatever it takes, just contact me. I can't put myself in your shoes and feel your pain on things like this unless we have a chance to talk about it. I look forward to our conversation.

-Skinman

June 18, 2008

Planning for Data Center Disasters Doesn’t Have to Cost a Lot of $$

One of the hot topics over the past couple of weeks in our growing industry has been how to minimize downtime should your (or your host’s) data center experience catastrophic failure leading to outages that could span multiple days.

Some will think that it is the host’s responsibility to essentially maintain a spare data center into which they can migrate customers in case of catastrophe. The reason we don’t do this is simple economics. To maintain this type of redundancy, we’d need to charge you at least double our current rates. Because costs begin jumping exponentially instead of linearly as extensive redundancy is added, we’d likely need to charge you more than double our current rates. You know what? Nobody would buy at that point. It would be above the “reservation price” of the market. Go check your old Econ 101 notes for more details.

Given this economic reality, we at SoftLayer provide the infrastructure and tools for you to recover quickly from a catastrophe with minimal cost and downtime. But, every customer must determine which tools to use and build a plan that suits the needs of the business.

One way to do this is to maintain a hot-synched copy of your server at a second of our three geographically diverse locations. Should catastrophe happen to the location of your server, you will stay up and have no downtime. Many of you do this already, even keeping servers at multiple hosts. According to our customer surveys, 61% of our customers use multiple providers for exactly that reason – to minimize business risk.

Now I know what you’re thinking – “why should I maintain double redundancy and double my costs if you won’t do it?” Believe me, I understand this - I realize that your profit margins may not be able to handle a doubling of your costs. That is why SoftLayer provides the infrastructure and tools to provide an affordable alternative to running double infrastructure in multiple locations in case of catastrophe.

SoftLayer’s eVault offering can be a great cost effective alternative to the cost of placing servers in multiple locations. Justin Scott has already blogged about the rich backup features of eVault and how his backup data is in Seattle while his server is in Dallas, so I won’t continue to restate what he has already said. I will add that eVault is available in each of our data centers, so no matter where your server is at SoftLayer, you can work with your sales rep to have your eVault backups in a different location. Thus, for prices that are WAY lower than an extra server (eVault starts at $20/month), you can keep near real-time backups of your server data off site. And because the data transfer between locations happens on SoftLayer’s private network, your data is secure and the transfer doesn’t count toward your bandwidth allotment.

So let’s say your server is in our new Washington DC data center and your eVault backups are kept in one of our Dallas data centers. A terrorist group decides to bomb data centers in the Washington DC area in an attempt to cripple US government infrastructure and our facility is affected and won’t be back up for several days. At this point, you can order a server in Dallas, and once it is provisioned in an hour or so, you restore the eVault backup of your choice, wait on DNS to propagate based on TTL, and you’re rolling again.

Granted, you do experience some downtime with this recovery strategy. But the tradeoff is that you are up and running smoothly after the brief downtime at a cost for this contingency that begins at only $20 per month. And when you factor in your SLA credit on the destroyed server, this offsets the cost of ordering a new server, so the cost of your eVault is the only cost of this recovery plan.

This is much less than doubling your costs with offsite servers to almost guarantee no downtime. The reason that I throw in the word “almost” is that if an asteroid storm takes out all of our locations and your other providers’ locations, you will experience downtime. Significant downtime.

-Gary

Subscribe to downtime