Posts Tagged 'Problems'

July 25, 2012

ServerDensity: Tech Partner Spotlight

We invite each of our featured SoftLayer Tech Marketplace Partners to contribute a guest post to the SoftLayer Blog, and this week, we're happy to welcome David Mytton, Founder of ServerDensity. Server Density is a hosted server and website monitoring service that alerts you when your website is slow, down or back up.

5 Ways to Minimize Downtime During Summer Vacation

It's a fact of life that everything runs smoothly until you're out of contact, away from the Internet or on holiday. However, you can't be available 24/7 on the chance that something breaks; instead, there are several things you can do to ensure that when things go wrong, the problem can be managed and resolved quickly. To help you set up your own "get back up" plan, we've come up with a checklist of the top five things you can do to prepare for an ill-timed issue.

1. Monitoring

How will you know when things break? Using a tool like Server Density — which combines availability monitoring from locations around the world with internal server metrics like disk usage, Apache and MySQL — means that you can be alerted if your site goes down, and have the data to find out why.

Surprisingly, the most common problems we see are some that are the easiest to fix. One problem that happens all too often is when a customer simply runs out of disk space in a volume! If you've ever had it happen to you, you know that running out of space will break things in strange ways — whether it prevents the database from accepting writes or fails to store web sessions on disk. By doing something as simple as setting an alert to monitor used disk space for all important volumes (not just root) at around 75%, you'll have proactive visibility into your server to avoid hitting volume capacity.

Additionally, you should define triggers for unusual values that will set off a red flag for you. For example, if your Apache requests per second suddenly drop significantly, that change could indicate a problem somewhere else in your infrastructure, and if you're not monitoring those indirect triggers, you may not learn about those other problems as quickly as you'd like. Find measurable direct and indirect relationships that can give you this kind of early warning, and find a way to measure them and alert yourself when something changes.

2. Dealing with Alerts

It's no good having alerts sent to someone who isn't responding (or who can't at a given time). Using a service like Pagerduty allows you to define on-call rotations for different types of alerts. Nobody wants to be on-call every hour of every day, so differentiating and channeling alerts in an automated way could save you a lot of hassle. Another huge benefit of a platform like Pagerduty is that it also handles escalations: If the first contact in the path doesn't wake up or is out of service, someone else gets notified quickly.

3. Tracking Incidents

Whether you're the only person responsible or you have a team of engineers, you'll want to track the status of alerts/issues, particularly if they require escalation to different vendors. If an incident lasts a long time, you'll want to be able to hand it off to another person in your organization with all of the information they need. By tracking incidents with detailed notes information, you can avoid fatigue and prevent unnecessary repetition of troubleshooting steps.

We use JIRA for this because it allows you to define workflows an issue can progress along as you work on it. It also includes easy access to custom fields (e.g. specifying a vendor ticket ID) and can be assigned to different people.

4. Understanding What Happened

After you have received an alert, acknowledged it and started tracking the incident, it's time to start investigating. Often, this involves looking at logs, and if you only have one or two servers, it's relatively easy, but as soon as you add more, the process can get exponentially more difficult.

We recommend piping them all into a log search tool like (fellow Tech Partners Marketplace participant) Papertrail or Loggly. Those platforms afford you access to all of your logs from a single interface with the ability to see incoming lines in real-time or the functionality to search back to when the incident began (since you've clearly monitored and tracked all of that information in the first three steps).

5. Getting Access to Your Servers

If you're traveling internationally, access to the Internet via a free hotspot like the ones you find in Starbucks isn't always possible. It's always a great idea to order a portable 3G hotspot in advance of a trip. You can usually pick one up from the airport to get basic Internet access without paying ridiculous roaming charges. Once you have your connection, the next step is to make sure you can access your servers.

Both iPhone and Android have SSH and remote desktop apps available which allow you to quickly log into your servers to fix easy problems. Having those tools often saves a lot of time if you don't have access to your laptop, but they also introduce a security concern: If you open server logins to the world so you can login from the dynamic IPs that change when you use mobile connectivity, then it's worth considering a multi-factor authentication layer. We use Duo Security for several reasons, with one major differentiator being the modules they have available for all major server operating systems to lock down our logins even further.

You're never going to escape the reality of system administration: If your server has a problem, you need to fix it. What you can get away from is the uncertainty of not having a clearly defined process for responding to issues when they arise.

-David Mytton, ServerDensity

This guest blog series highlights companies in SoftLayer's Technology Partners Marketplace.
These Partners have built their businesses on the SoftLayer Platform, and we're excited for them to tell their stories. New Partners will be added to the Marketplace each month, so stay tuned for many more come.
March 31, 2010

I Am the Cell Phone Person

Being the “cell phone person” here at SoftLayer has its challenges, to put it mildly. I thought that working with mostly boys (yes, I meant to say boys) would be a breeze compared to a bunch of women (we tend to be a bit ummm, picky?). I was terribly wrong! They are WORSE! Especially with gadgets like cell phones, considering the field we are in. For some reason a lot of them think that because they can configure a server they also know exactly what is wrong with their phone without actually troubleshooting it at all or why they MUST have this phone or that phone.

Reboot?! Why?! Hmmmm that was one of the first things I learned to ALWAYS do. I learned this from Jacob Linscott, my first IT guy back in 1997, who I work with once again; he is our Director of IT – Linux. I learned very quickly that I had better not EVEN think about calling him until I had rebooted my computer. Amazingly enough, I’d say the odds on a reboot fixing the issue with both computers and cell phones is very high, but that’s about the only thing that is similar in regards to issues between the two. I have been amazed at the multitude of varying issues as well as the information you can find online to fix a phone without having to call the carrier; and, that is a real life saver!

What baffles me is that everyone seems to know what’s wrong with their phone without actually researching it. When I say “So you Googled that and found info that said it was most likely the issue?” I get “nah, I just think that’s it.” I just shake my head, take their phone, and walk away. I Google my rear end off all the time! I am as specific as possible when I do a search. Such as, “my 8320 can send SMS, but is not receiving them.” Seems obvious, right? Wrong!

One would think the Geektopia of staff we have would do the same, WRONG! There is a world of knowledge and information out there regarding any number of BlackBerry and iPhone issues if you simply just take a few minutes to type your issue into a search engine. Heck, you don’t have to use Google, you can use whatever search engine you want! I’ve sent out emails regarding tips and tricks, the problem I seem to have is getting people to actually read the info. Admittedly, we get hundreds and hundreds of emails a day, some days thousands, depending on what group lists they are on; so I’ll give a little slack. It’s simply a case of missing the obvious, like when you are trying to fix a computer and it won’t work and it turns out to be the simplest thing that was forgotten, happens with phone issues too. Everyone just goes into panic mode when their phone isn’t functioning, amazing how we lived without cell phones just 20 years ago.

When SL was starting up just a few years ago, our VP of Sales was the cell phone person and he wasn’t too thrilled. He couldn’t WAIT to pass it on to someone else. I was the chosen one or sucker, depending how you look at it. I remember sitting in my cube my first week at SL, which wasn’t too far from his office, and giggling when he had to call the carrier and deal with some phone issues. I don’t giggle anymore. They told me by no means was it a punishment, taking over this particular job duty, but some days I wonder—especially the days when I get stuck on the phone for hours and hours trying to get a phone fixed, repeating myself over and over to 5 different people in 5 departments! It’s a source of some major meltdowns to say the least.

You see, we have about 130 phones throughout the company in four different locations. Dallas has Corporate and the DC and of course Seattle and WDC. So a lot of phones, a lot of folks, a lot of issues; from “My phone got ruined when I went hiking wearing khaki’s and got caught in a rain storm, the rain soaked through and ruined my phone, can I have a better one now?” to, “I lost it at the Christmas Party, sorry” to “If I step on it, does that mean I have to pay for it, because I want a better one?!” Yes, those are just a few of them, and obviously some of my favorites.

I, with the help of a few others, just recently upgraded 31 phones; Lance our CEO is cool like that. You see, the 31 were 8700c BB models, or fondly referred to as “coasters” around here. Of course they were spread across our four locations, so this required lots of coordination with someone on the other end of the line. This upgrade took over a month due to device issues (new phone to market at the time).

The guys in the Dallas NOC all know better than to laugh as they hear my cursing due to being on the phone for countless hours; or if they do, they’ve gotten much better about hiding it. The point of all of this is to remind you that if you have a company cell phone and it has issues, be kind to your cell phone person and know that you are not the only one with an issue. Cell phones break. Cell phones die. Cell phones get dropped on the ground, in the toilet, or, my favorite, thrown across a room in anger every single day. So if your cell phone person can’t get to you RIGHT THAT MINUTE, try trouble shooting it yourself. No, not installing things, but maybe just try and look up your issue, and let them know what you found. Send them the link or print it out. It will make their day. Trust me on this one!

January 12, 2010

SLXXXXX Twitter Log

8/24/2009 1:00PM – Just ordered 3 more servers from SL. Man I love how easy it is to order, and the provisioning time is incredible.

8/24/2009 11:45PM – Got the new servers setup; now I have redundancy for my app. G’nite.

9/04/2009 8:00AM – Suhweet, just passed 50K users for my app. Hitting the pool.

9/21/2009 6:42PM – Oops, app crashed too many users. Recovering now. Thank goodness for monitoring alerts.

9.21/2009 8:13PM – Sorry all, app back up. SL CloudLayer really helped. Their portal makes it all easy.

9/22/2009 3:13AM – Ok stayed up late tonight and added new functionality to the app and added a new app server, geographic load balancing baby!

10/6/2009 2:45PM – Thanks for all the support on the app, keep the new ideas coming. 450K users and growing.

10/31/2009 5:50PM – Happy Halloween! 627K users. Thank you!!

11/14/2009 6:02AM – Getting close 989K users. Party at 1 Million. Just added 2 new front end servers in each DC, adding cloud storage now for Data replication/protection.

11/21/2009 7:31AM– It’s finally here 1 Mil. Party time! Isn’t ad revenue the greatest. The in game pay to play money is fun too. Thanks all!

12/10/2009 4:42PM – Still growing. I was alerted that one server crashed. No users affected. Technology is cool.

12/18/2009 9:16PM– ‘Bout to go silent for the Holidays. Hope you all have good ones. See you at 1.5 million when I return.

12/19/2009 7:00AM – Decided to add a couple more cloud instances for good measure. App is smoking fast.

12/31/2009 10:45PM – Monitoring just hit my phone, at party will check asap.

12/31/2009 11:00PM – Found a netbook at the party. App is crashed. Looking.

12/31/2009 11:07 PM – WT? All servers down, hard down. SL up and friend app good on SL network. Investigating, sorry for outage.

12/31/2009 11:10 PM – Hackers? Not sure all servers affected. Ping only. Had very secure. No problem before.

12/31/2009 11:29PM – Portal password got hacked. Intruders OS reloaded every server with RedHat, turned off all CCI.

1/04/2009 6:00AM – Happy New Year, mine sucked – app back – 5000 daily users. Sad day.

While the above is completely fictional, it could happen to just about anyone. Don’t let it happen to you. No matter how long and how secure you think your password is, there is someone out there who can crack it. It is one thing keeping a server secure and most technical geniuses are very adept at doing just that. With all the time and effort it takes to keep your servers secure, you might find that you have slipped in other areas. SoftLayer is here to help in VIP Style.

The cutting edge SoftLayer portal now has optional Two Factor Authentication support using VeriSign’s Identity Protection. First, what is Two Factor Authentication? It is defined as, “something you know (password) and something you HAVE (pin number of sorts).” Here is how it works:

You buy a physical device in the form of a keychain token or a credit card token; or in the cool age of technology, you can simply get one of the free phone apps that do the same thing for you without the extra piece of equipment to carry. Once you get the device/app you would go to the portal and register the token’s unique ID and attach it to a username on the account. The master user gets this FREE and then if you want other users on your account to have this functionality it is $3 per user per month. If the master user does turn on this functionality no one else will be allowed into the system without using two factor authentication. Once this is setup, the user will login using their “known” password and then they will also have to enter the “code” (the thing you have) on the token device or phone app to gain access. The code changes on a fast schedule so this is extremely secure. This would have made the New Year’s celebration for the person above much more fun.

One last thing, since we partnered with VeriSign you can use the token device or phone app for different sites that use the VeriSign product. PayPal is one example. Here is a complete list.

Now that you know about it, and now that we offer it, don’t be the guy that doesn’t keep the portal secure and misses out on a Happy New Year!

June 15, 2009

Help Us Help You

Working the System Admin queue in the middle of the night I see lots of different kinds of tickets. One thing that has become clear over the months is that a well formed ticket is a happy ticket and a quickly resolved one. What makes a well-formed ticket? Mostly it is all about information and attention to these few suggestions can do a great deal toward speeding your ticket toward a conclusion.

Category
When you create a ticket you're asked to choose a category for it, such as "Portal Information Question" or "Reboots and Remote Access". Selecting the proper category helps us to triage the tickets. If you're locked out of your server, say due to a firewall configuration, you'd use "Reboots and Remote Access". We have certain guys who are better at CDNLayer tickets, for example, and they will seek out those kind so if you have a CDN question, you'd be best served by using that category. Avoid using Sales and Accounting tickets for technical issues as those end up first in their respective departments and not in support.

Login Information
This one is a bit controversial. I'm going to state straight out... I get that some people don't want us knowing the login information for the server. My personal server at SoftLayer doesn't have up-to-date login information in the portal. I do this knowing that this could slow things down if I ever had to have one of the guys take a look at it while I'm not at work.

If necessary, we can ask for it in the ticket but that can cost you time that we could otherwise be addressing your issue. If you would like us to log into your server for assistance, please provide us with valid login information in the ticket form. Providing up-to-date login credentials will greatly expedite the troubleshooting process and mitigate any potential downtime, but is not a requirement for us to help with issues you may be facing.

Server Identification
If you have multiple servers with us, please make sure to clearly identify the system involved in the issue. If we have a doubt, we're going to stop and ask you, which again can cost you time.

Problem Description
This is really the big one. When typing up the problem description in the ticket please provide as much detail as you can. Each sentence of information about the issue can cut out multiple troubleshooting steps which is going to lead to a faster resolution for you.

Example:

  • Not-so-good: I cannot access my server!
  • Good: I was making adjustments to the Windows 2008 firewall on my server and I denied my home IP of 1.2.3.4 instead of allowing it. Please fix.

The tickets describe the same symptom. I can guarantee though we're going to have the second customer back into his server quicker because we have good information about the situation and can go straight to the source of the problem.

Categories: 
Subscribe to problems