Posts Tagged ‘issues’

October 25, 2012

Tips from the Abuse Department: Save Your Sinking Ship

By in Customer Service, SoftLayer, Tips and Tricks

I often find that the easiest way to present a complex process is with a relatable analogy. By replacing esoteric technical details with a less intimidating real-world illustration, smart people don’t have to be technically savvy to understand what’s going on. When it comes to explaining abuse-related topics, I find analogies especially helpful. One that I’m particularly keen on in explaining Abuse tickets in the context of a sinking ship.

How many times have you received an Abuse ticket and responded to the issue by suspending what appears to be the culprit account? You provide an update in the ticket, letting our team know that you’ve “taken care of the problem,” and you consider it resolved. A few moments later, the ticket is updated on our end, and an abuse administrator is asking follow-up questions: “How did the issue occur?” “What did you do to resolve the issue?” “What steps are being taken to secure the server in order to prevent further abuse?”

Who cares how the issue happened if it’s resolved now, right? Didn’t I respond quickly and address the problem in the ticket? What gives? Well, dear readers, it’s analogy time:

You’re sailing along in a boat filled with important goods, and the craft suddenly begins to take on water. It’s not readily apparent where the water is coming from, but you have a trusty bucket that you fill with the water in the boat and toss over the side. When you toss out all the water onboard, is the problem fixed? Perhaps. Perhaps not.

You don’t see evidence of the problem anymore, but as you continue along your way, your vessel might start riding lower and lower in the water — jeopardizing yourself and your shipment. If you were to search for the cause of the water intake and take steps to patch it, the boat would be in a much better condition to deliver you and your cargo safely to your destination.

In the same way that a hull breach can sink a ship, so too can a security hole on your server cause problems for your (and your clients’) data. In the last installment of “Tips from the Abuse Department,” Andrew explained some of the extremely common (and often overlooked) ways servers are compromised and used maliciously. As he mentioned in his post, Abuse tickets are, in many cases, the first notification for many of our customers that “something’s wrong.”

At a crucial point like this, it’s important to get the water out of the boat AND prevent the vessel from taking on any more water. You won’t be sailing smoothly unless both are done as quickly as possible.

Let’s look at an example of what thorough response to an Abuse ticket might look like:

A long-time client of yours hosts their small business site on one of your servers. You are notified by Abuse that malware is being distributed from a random folder on their domain. You could suspend the domain and be “done” with the issue, but that long-time client (who’s not in the business of malware distribution) would suffer. You decide to dig deeper.

After temporarily suspending the account to stop any further malware distribution, you log into the server and track down the file and what permissions it has. You look through access logs and discover that the file was uploaded via FTP just yesterday from an IP in another country. With this IP information, you search your logs and find several other instances where suspicious files were uploaded around the same time, and you see that several FTP brute force attempts were made against the server.

You know what happened: Someone (or something) scanned the server and attempted to break into the domain. When the server was breached, malware was uploaded to an obscure directory on the domain where the domain owners might not notice it.

With this information in hand, you can take steps to protect your clients and the server itself. The first step might be to implement a password policy that would make guessing passwords very difficult. Next, you might add a rule within your FTP configuration to block continued access after a certain number of failed logins. Finally, you would clean the malicious content from the server, reset the compromised passwords, and unsuspend the now-clean site.

While it’s quite a bit more work than simply identifying the domain and account responsible for the abuse and suspending it, the extra time you spent investigating the cause of the issue will prevent the same issue from happening after your client “fixes” the problem by deleting the files/directories. Invariably, they’d get compromised again in the same way when the domain is restored, and you’d hear from the Abuse department again.

Server security goes hand in hand with systems administration, and even though it’s not a very fun part of the job, it is a 24/7 responsibility that requires diligence and vigilance. By investing time and effort into securing your servers and fixing your hull breach rather than just bailing water overboard, your customers will see less downtime, you’ll be using your server resources more efficiently, and (best of all) you won’t have the Abuse team hounding you about more issues!

-Garrett

P.S. I came up with a brilliant analogy about DNS and the postal service, so that might be a topic for my next post …

October 10, 2012

On-Call for Dev Support AND a New Baby

By in Culture, Funny, SoftLayer

I began working at SoftLayer in May of 2010 as a customer support administrator. When I signed on, I was issued a BlackBerry to help me follow tickets and answer questions from my coworkers when I was out of the office. In August of 2011, that sparingly used BlackBerry started getting a lot more use. I became a systems engineer in development support, and I was tasked to provide first-tier support for development-related escalations, and I joined the on-call rotation.

In the Dev Support group, each systems engineer works a seven-day period each month as the on-call engineer to monitor and respond to off-hours issues. I enjoy tackling challenging problems, and my Blackberry became an integral tool in keeping me connected and alerting me to new escalations. To give you an idea of what kinds of issues get escalated to development support, let me walk you through one particularly busy on-call night:

I leave the office and get home just in time to receive a call about an escalation. An automated transaction is throwing an error, and I need to check it out. I unload my things, VPN into the SoftLayer network and begin investigating. I find the fix and I get it implemented. I go about my evening, and before I get in bed, I make sure my BlackBerry is set to alert me if a call comes in the middle of the night. Escalations to development support typically slow down after around 11 p.m., but with international presences in Amsterdam and Singapore, it’s always good to be ready for a call 2:30 a.m. to make sure their issues are resolved with the same speed as issues found in the middle of the day in one of our US facilities.

Little did I know, my SoftLayer experience was actually preparing me for a different kind of “on-call” rotation … One that’s 24x7x365.

In June 2012, my wife and I adopted an infant from El Paso, Texas. We’d been trying to adopt for almost two years, and through lots of patience and persistence, we were finally selected to be the parents of a brand new baby boy. When we brought him home, he woke up every 3 hours for his feeding, and my on-call work experience paid off. I didn’t have a problem waking up when it was my turn to feed him, and once he was fed, I hopped back in bed to get back to sleep. After taking a little time off to spend with the new baby, I returned to my job, and that first week back was also my turn on the on-call rotation.

The first night of that week, I got a 1 a.m. call from Amsterdam to check out a cloud template transfer that was stuck, and I got that resolved quickly. About 30 minutes later, our son cried because he was hungry, so I volunteered to get up and feed him. After 45 minutes, he’d eaten and fallen asleep again, so I went back to bed. An hour later, I got a call from our San Jose to investigate a cloud reload transaction that was stalling with an error. I worked that escalation and made it back to bed. An hour and a half later, the little baby was hungry again. My wife graciously took the feeding responsibilities this time, and I tried to get back to sleep after waking up to the baby’s cries. About an hour later, another data center had an issue for me to investigate. At this point, I was red-eyed and very sleepy. When my teammates got up the next morning, they generously took the on-call phone number so I could try to get some rest.

This pattern continued for the next six days. By the end of that first week, I got a call from work at about 3 a.m., and I picked up the Baby Monitor from the night stand and answered, “Dev support, this is Greg.” My wife just laughed at me.

I’ve come to realize that being on-call for a baby is a lot more difficult than being on-call for development support. In dev support, I can usually documentation on how to resolve a given issue. I can search my email for the same error or behavior, and my coworkers are faithful to document how they resolve any unique issues they come across. If I get to a point where I need help, I can enlist the assistance of an SME/Developer that commonly works on a given piece of code. When you’re on-call with a baby, all the documentation in the world won’t help you get your newborn to stop crying faster, you don’t get any clear “error messages” to guide you to the most effective response, and you can’t pass the baby off to another person if you can’t figure out what’s wrong.

And when you’re on-call for development support, you get some much-needed rest and relaxation after your seven days of work. When you’re on-call for a new baby, you’ve got at least a few months of duty before you’re sleeping through the night.

As I look back at those long nights early on, I laugh and appreciate important things in my life: My wife, my son, my job and my coworkers.

– Greg