Author Archive: Michael Patrick

December 7, 2009

Availability with NetScaler VPX and Global Load Balancing

The concept Single Point of Failure refers to the fact that somewhere between your clients and your servers there is a single point that if it fails downtime happens. The SPoF can be the server, the network, or the power grid. The dragon Single Point of Failure is always going to be there stalking you; the idea is to push SPoF far enough out to where you have done the best you can with your ability and budget.

At the server level you could combat SPoF by using redundant power supplies and disks. You can also have redundant servers fronted by a load balancer. One of the benefits when using load balancer technology is that the traffic for an application is spread between multiple app servers. You have the ability to take an app server out of rotation for upgrades and maintenance. When you’re done you bring the server back online, the load balancer notices it UP on the next check and the server is back in service.

Using a NetScaler VPX you can even have two groups of servers—one group which generally answer your queries and another group which usually does something else—with the second group functioning as a backup against all of the primary servers for a service having to be taken down through the Backup Virtual Server function.

Result: no Single Point of Failure for the app servers.

What happens if you are load balancing and have to take the load balancer out of service for upgrades or maintenance? Right, now we’ve moved SPoF up a level. One way to handle this is by using the NetScaler VPX product we have at SoftLayer. A pair of VPX instances (NodeA/NodeB) can be teamed in a failover cluster so that if the primary VPX is taken down (either by human action or because the hardware failed) the secondary VPX will begin answering for the IPs within a few seconds and processing the actions. When you bring NodeA back online it slips into the role of secondary until such time as NodeB fails or is taken down. I will note here that VPX instances do have dependency on certain network resources and that dependency can take both VPX instances down.

Result: Loss of a single VPX is not a Single Point of Failure.

So what’s next? A wide-ranging power failure or general network failure of either the frontend or the backend network could render both of the NetScalers in a city unusable or even the entire facility unusable. This can be worked around by having resources in two cities which are able to process queries for your users and by using the Global Load Balancer product we offer. GLB load balances between the cities using DNS results. A power failure taking down Seattle just means your queries go to Dallas instead. Why not skip the VPX layer and just GLB to the app servers? You could, if you don’t have a need for the other functionalities from the VPX.

Result: no single point of failure at the datacenter level

Having redundant functionality between cities takes planning, it takes work, and it takes funding. You have to consider synchronization of content. The web content is easy. Run something like an rsync from time to time. Synching the database content between machines or across cities is a bit more complicated. I’ve seen some customers use the built-in replication capabilities of their database software while others will do a home-grown process such as having their application servers write to multiple database servers. You also have to consider issues of state for your application. Can your application handle bouncing between cities?

Redundancy planning is not always fun but it is required for serious businesses, even if the answer is ultimately to not do any redundancy. People, hardware and processes will fail. Whether a failure event is a nightmare or just an annoyance depends on your preparation.

October 2, 2009

Is That a Real Computer?

Some mornings after work when the weather is nice I'll go to a local coffee shop on the way home to read or study for the CCNA exams. Sometimes I'll just end up pulling out the netbook and browse around online. There are times during these outings when I'll get asked the title question of this blog: is that a real computer? I guess the size that throws people but the answer is yes.

For those who are not familiar with the netbook class of systems here are the specs for mine:

  • 10.2 inch screen
  • 1 GB RAM
  • 1.6GHz Intel Atom processor
  • 160GB SATA hard drive
  • 3 USB ports
  • Card reader
  • Built-in Wifi
  • Built-in webcam
  • Windows XP (I've got plans for Windows 7)
  • 5 hour battery life
  • Light weight (I've got books that weigh more)

Netbooks are great for when you're just knocking around town and might want to do some light web work. This morning while at Starbucks I've checked e-mail several times, caught up on the daily news, and reviewed the game statistics from the Cowboys game I missed last night. Other mornings I've fired up a VPN connection into the office and been able to remotely help with tickets, work on documentation for our SSL product and tinker around with a NetScaler VPX Express virtual machine (an interesting bit of tech for a later article).

So how does this tie into server hosting?

You've probably had a time when your monitoring has indicated a service ceasing to respond on a server. If all you have is a cell phone the options are somewhat limited. With a fancy enough phone you might have an SSH or RDP client but do you really want to do anything on a PDA sized screen? I didn't think so. You can put in a ticket from your phone and our support can help out but the person best able to fix a service failure is still going to be you, the server administrator who knows where all the bodies are buried and how the bits tie together.

A small netbook can be a lightweight (and inexpensive) administration terminal for your servers hosted with us. Just find an Internet connection, connect up to the SoftLayer VPN and now you have complete access to work on your servers via a secure connection.

Through the wonders of the IPMI KVM this access even includes the console which opens up the possibility of doing a custom kernel build and install safely, while sitting under the stars, drinking a hot chocolate and watching the local nightlife.

Sounds like a pretty nice reality to me.

September 2, 2009

SSL Comes to SoftLayer

Those who keep a close eye on the menu options in the customer management portal will have noticed that recently there was added an option under Security where you can now order SSL certificates. For those not familiar with SSL, a certificate is used by an application to establish identity and provide encryption services. Naturally you do not have to order your SSL certificates through us. Certificates ordered other places will work just fine on your server here. Certificates ordered here will work fine elsewhere.

So why order your SSL through SoftLayer? To me, its a convenience and security thing. Ordering with us is convenient because you can place and manage the order via the portal just like you manage aspects of your account already. Management includes being able to see when your certificates are going to expire and the ability to renew them. If the certificate file itself is deleted by accident you can get a copy of it e-mailed via the portal. From a security point of view you already have a billing arrangement with us so why give your credit card information to another party?

I can see someone thinking "But is that safe.. what if I leave SoftLayer?" Yes, it is safe. The only information you have to provide to us in doing the ordering is the Certificate Signing Request and some billing verification. Both of these are things that would be provided to any SSL vendor. The private key, which is the core of SSL security, is not kept or handled by SoftLayer. The private key is generated and remains with your administration staff on your server.

Let us chat about the private key for a moment. The private key is meant to be known only by the server applications to which it is assigned on your server. If it is lost, corrupted, deleted, whatever it will require a new certificate. What this all means is that you should only allow people you really trust access to the private key and above all you must keep a good, safe backup of the file. SoftLayer support can perform quite a bit of server voodoo but recreating a lost private key isn't an option.

I'd invite anyone with a bit of time to experiment with the SSL functionality we offer. You might find something useful for your business.

August 26, 2009

Cool Tool: nslookup

If you've been around the Internet awhile you've probably heard of the Domain Name Service. DNS is what takes www.domain.com and turns it into the 1.2.3.4 IP address which your application actually uses to find the server hosting www.domain.com.

Fascinating, Michael, why do I care? Well if you ask that question you've never had DNS fail on you.

When name resolution goes on the blink one of the tools that support uses to see what is going is the command-line utility nslookup. In its most basic form nslookup is going to do an A record query for the string you supply as an argument and it'll send that query to your operating system's configured resolvers.


C:\>nslookup www.softlayer.com
Server: mydns.local
Address: 192.168.0.1

 


Non-authoritative answer:
Name: www.softlayer.com
Address: 66.228.118.51


What is the utility telling us? First off, it asked a resolver at 192.168.0.1 for the information. Non-authoritative answer means that the server which returned the answer (192.168.0.1) is not the nameserver which controls softlayer.com. It then gives the IP address or addresses which were found.



C:\>nslookup -q=mx softlayer.com ns1.softlayer.com
Server: ns1.softlayer.com
Address: 67.228.254.4


softlayer.com MX preference = 20, mail exchanger = mx02.softlayer.com
softlayer.com MX preference = 30, mail exchanger = mx03.softlayer.com
softlayer.com MX preference = 10, mail exchanger = mx01.softlayer.com
softlayer.com nameserver = ns2.softlayer.net
softlayer.com nameserver = ns1.softlayer.net

 


This is a slightly different query. Rather than asking my local resolver to do an A record query for www.softlayer.com I've sent an MX (mail exchanger) query for softlayer.com directly to the nameserver ns1.softlayer.com. Notice that the response does not have the non-authoritative tag. The server ns1.softlayer.com is one of the nameservers which is configured to respond with a definite answer to a question rather than just saying "well, this other guy said...".

One thing that both of these queries fail to do is show the TTL for the answer they give. Time to Live (TTL) is what generally controls how long a resolver will keep an answer in cache. While the TTL is valid the resolver will use that answer. Once the TTL expires, the resolver goes looking for a fresh answer. This is great for performance but it does have a dark side to it: because of TTL, changes to DNS records are not seen instantly by all clients. If ClientA hits your website often his resolver is going to have the query result cached (say www.domain.com -> 1.2.3.4). You change the record to www.domain.com -> 5.6.7.8 but ClientA's resolver is going to continue to respond with 1.2.3.4 until the TTL runs out. If ClientA controls their resolver they can flush its cache. Generally though it is controlled by their ISP and you just have to wait.

To see the TTL for an answer you can use the nslookup form below:



C:\>nslookup
Default Server: mydns.local
Address: 192.168.5.1

 


> set debug
> www.softlayer.com.
Server: mydns.local
Address: 192.168.5.1


------------
Got answer:
HEADER:
opcode = QUERY, id = 2, rcode = NOERROR
header flags: response, want recursion, recursion avail.
questions = 1, answers = 1, authority records = 2, additional = 0


QUESTIONS:
www.softlayer.com, type = A, class = IN
ANSWERS:
-> www.softlayer.com
internet address = 66.228.118.51
ttl = 86400 (1 day)
AUTHORITY RECORDS:
-> softlayer.com
nameserver = ns1.softlayer.net
ttl = 86400 (1 day)
-> softlayer.com
nameserver = ns2.softlayer.net
ttl = 86400 (1 day)


------------
Non-authoritative answer:
Name: www.softlayer.com
Address: 66.228.118.51


The key to this spew is 'set debug' which causes nslookup to display additional information about the response, including the TTL value of the answer. You'll notice that the TTL in the ANSWERS section is 86400 seconds, which is the number of seconds in one day. This is a common TTL value. If I run the query again though, I have the following answers section:



ANSWERS:
-> www.softlayer.com
internet address = 66.228.118.51
ttl = 85802 (23 hours 50 mins 2

 


Notice how the TTL is counting down. The resolver is going to continue responding with the answer 66.228.118.51 until that TTL hits zero. At zero, the resolver will go looking for a new answer. What this means for you as a domain operator is that if you know you're going to be changing a record you should adjust down the TTL for that record a couple of days in advance. For example when some friends and I moved our colo server from one provider to another we dropped the TTLs for our DNS records down to 30 minutes two days prior to the move. Once the move was complete we were able to put them back to prior values.

 

If you spend any time at all messing with DNS you should play around with nslookup.

If you're on a Unix system take a look at the command 'dig' as well.

Happy resolving.

August 18, 2009

Backups Are Not the Whole Story

Last night while making my regular backup for my World of Warcraft configuration, I thought about the blog and I didn't remember seeing an article that went into more detail than "backups are good" about backing up and restoring data.

If you've been around the InnerLayer for a while you will have noticed that backing up of data comes up periodically.  This happens because we frequently see customers whose world is turned upside down due to a mistyped command wiping out their data.  If you just thought "that won't happen to me... I'm careful at a prompt"... well, how about a cracker getting in via an IIS zero day exploit?  Kernel bug corrupting the filesystem?  Hard drive failure?  Data loss will happen to you, whatever the cause.

Data that is not backed up is data that isn't viewed as important by the server administrator.  As the title of this blog mentioned, backing up isn't the end of the server administrator's responsibility.  Consider the following points.

  • Is the backup in a safe location?  Backing up to the same hard drive which houses the live data is not a good practice.
  • Is the backup valid?  Did the commands to create it all run properly?  Did they get all the information you need?  Do you have enough copies?
  • Can your backup restore a single file or directory?  Do you know how to restore it?  Simply put, a restore is getting data from a backup back into a working state on a system.

Backup Safety
At a minimum backups should be stored on a separate hard drive from the data which the backup is protecting.  Better would be a local copy of the backup on the machine in use and having a copy of the backup off the machine, perhaps in eVault, on a NAS which is _NOT_ always mounted, even on another server.  Why?  The local backup gives you quick access to the content while the off-machine copies give you the safety that if one of your employees does a secure wipe on the machine in question you haven't lost the data and the backup.

Validity
A backup is valid if it gets all the data you need to bring your workload back online in the event of a failure.  This could be web pages, database data, config files (frequently forgotten) and notes on how things work together.  Information systems get complicated and if you've got a Notepad file somewhere listing how Tab A goes into Slot B, that should be in your backups.  Yes, you know how it works... great, you get hit by a bus, does your co-admin know how that system is put together?  Don't forget dependencies.  A forum website is pretty worthless if it is backed up but the database to which it looks is not.  For me another mark of a valid backup is one which has some history.  Do not backup today and delete yesterday.  Leave a week or more of backups available.  People don't always notice immediately that something has broken.

Restores
A good way to test a restore is get a 2nd server for a month configured the same as your primary then take the backup from the primary and restore it onto the secondary.  See what happens.  Maybe it will go great.  Probably you will run into issues.  Forget about a small operating system tweak made some morning at 4am?  How about time?  How long does it take to go from a clean OS install to a working system?  If this time is too long, you might have too much going on one server and need to split up your workload among a few servers.  As with everything else in maintaining a server, practicing your restores is not a one-time thing.  Schedule yourself a couple of days once a quarter to do a disaster simulation.

For those who might be looking at this and saying "That is a lot of work".  Yes, it is.  It is part of running a server.  I do this myself on a regular basis for a small server hosting e-mail and web data for some friends.  I have a local "configbackup" directory on the server which has the mail configs, the server configs, the nameserver configs and the database data.  In my case, I've told my users straight up that their home directories are their own responsibility.  Maybe you can do that, maybe not.  Weekly that configback data is copied to a file server here at my apartment.  The fileserver itself is backed up periodically to USB drive which is kept at a friend's house.

Categories: 
July 27, 2009

Cool Tool: find

Have you ever gotten an e-mail from your server that a particular partition is filling up? Unfortunately, the e-mails don't usually tell you where the big files are hiding.

You can determine this and many other handy things by using the Unix utility 'find'. I use the 'find' command all the time in both my work at SoftLayer and also for running some sites that I manage outside of work. Being able to find the files owned by a particular user can be handy.

The 'find' command takes as arguments various tests to run on the files and directories that it scans. Just running 'find' with no arguments is going to list out the files and directories under your current location. Real power comes from using the different switches in various combinations.

find /some/path -name "myfile*" -perm 700

This format of the command will search for items within /some/path that have names starting with the string 'myfile' and also have the permission value of 700 (rwx------).

find /some/path -type f -size +50M

Find files that are larger than 50MB. The '-type f' argument tells find to only look for files.

find /some/path -type f -size +50M -ctime -7

Find files that are larger than 50MB and that have been created in the last seven days.

find /some/path -type f -size +50M -ctime -7 -exec ls -l {} \;

The -exec tells find to run some command against each match that it finds. In this case, it is going to run an 'ls -l'. Moves, removes and even custom full scripts are doable as well.

There are many, many more arguments that are possible for 'find'. Refer to the man pages for find on your particular flavor of Unix server to see all the different options for the command. As with all shell commands, know what you are running. Given the chance 'find' will wipe out anything it can ( via -exec rm {}, for example).

July 8, 2009

Encrypted Hot Chocolate

Imagine this scene: you’re sitting at a local coffee shop, having a drink and browsing the web. While checking out your favorite news site you see an e-mail come in where someone is commenting on your blog post from that morning. This is odd because while you remember checking blogs, you don’t remember posting one. On investigating you find a blog post that you definitely did not create. As you look around wondering what is going on you should probably take a peek at the guy sitting in the comfy chair with his headphones on running a wireless sniffer to grab passwords out of the air.

How does this happen? The coffee shops I’ve seen all run open wireless access points. This is great for flexibility and serving the most people but if the Access Point is unencrypted it is quite possible to run an application which will listen to the wireless network and record packets. These packets can then be examined by a tool like the Wireshark application. Since the packets are not encrypted then things like username/password combos can be read in clear text from the packet.

Knowing what the problem is, how do we work around it? Since we cannot encrypt the wireless session, we’ll encrypt the data itself. One option would be to do a VPN from your laptop in the coffee shop to a location out in the world. This would channel all of your traffic out through this other system. It’s a good solution but it can be a bit technically complicated if you don’t have one already set up. If you’re really only concerned with encrypting your HTTP traffic you could use the PuTTy application to tunnel traffic via an encrypted session to a Unix server here at SoftLayer by using the OpenSSH ability to act as a SOCKS5 proxy.

When you define your connection in PuTTy you can go down to Connection > SSH > Tunnels and then place a port number, such as 8080, in the Source field. Select “Dynamic” and “Auto” then click the Add button. Connect to your Unix server here at SoftLayer. Next stop is the browser. The way I configure my browser for this is to go into Firefox > Tools > Options > Advanced > Network > Settings (under Connection). Select “Manual proxy configuration” and then in the field labeled SOCKS Host: put “127.0.0.1” and for the port use the port you specified above. Leave the type as “SOCKS v5”. Select OK and then in the URL bar type “about:config” which will let you do advanced configuration. In the filter field type in “network.proxy.socks_remote_dns”. Right click on it and select Toggle. This will mark it true.

Now if you pull up a website which will tell you the IP that you are coming from (such as http://whatismyip.com) you should see it report back the IP address of your server here at SoftLayer. This happens because Firefox has been told to use 127.0.0.1:8080 as its SOCKSv5 proxy and this traffic gets tunneled via the encrypted SSH session to your server at SoftLayer. The server here will do the DNS lookup (due to network.proxy.socks_remote_dns) and then send out the request. The response will be tunneled back to your browser.

You do have to remember to fire up the PuTTY session but this isn’t so hard since if you try and browse without it the browser tries to hit the SOCKSv5 proxy port specified and fails. Beyond that I’ve not run into any troubles using this trick.

And now I think I will head off for some hot chocolate myself.

Categories: 
June 29, 2009

Leaving Normal

What is normal for a server? In support we get that question from time to time. The problem is that normal varies from server to server. A load average of 200 is probably not normal but a load of 5 to 10 very well could be normal, depending on the server's application. What to do?

Baselining to the rescue. The idea behind baselining is to get performance numbers on your application when things are "normal" so that you have solid math to indicate when things are not "normal".

What makes a good baseline? Things like RAM use (overall, per process, rate of change), number and types of processes running, processor usage, disk usage (total, per app), disk speed and network utilization are all good OS metrics. You can also get metrics from your application. E-mails per hour, web page generation time, and number of users logged in are good to know.

You can capture OS metrics using tools like top, free, ps and iostat on Linux. Actually if you have iostat you probably have 'sar' which is great for performance history. Sar has a process that runs every few minutes and records various OS counters including processor info, RAM use, disk I/O and the like.

For the Windows people you have Task Manager and Performance Monitor. Task Manager is pretty simple and gives mostly an overview. Perfmon is really where its at on Windows. Using PerfMon you can track dozens of performance counters on disk, proc, memory, the network and even application specific metrics if you are running apps like MS Exchange that support them.

As with most tasks related to being the lord and master of a server, performance monitoring isn't a one time thing. As you make changes to the system you have to run new baselines. Between changes you should run your performance routines periodically to see how things are changing. It is much easier to look into an issue if you spot it earlier rather than later.

Go forth and make sure all your baselines are belong to you!

*bonus cool points for those who knew the title of this blog was also the title of a "Roswell" episode.

Categories: 
June 19, 2009

Self Signed SSL

A customer called up concerned the other day after getting a dire looking warning in Firefox3 regarding a self-signed SSL certificate.

"The certificate is not trusted because it is self signed."

In that case, she was connecting to her Plesk Control Panel and she wondered if it was safe. I figured the explanation might make for a worthwhile blog entry, so here goes.

When you connect to an HTTPS website your browser and the server exchange certificate information which allows them to encrypt the communication session. The certificates can be signed in two ways: by a certificate authority or what is known as self-signed. Either case is just as good from an encryption point of view. Keys are exchanged and data gets encrypted.

So if they are equally good from an encryption point of view why would someone pay for a CA signed certificate? The answer to that comes from the second function of an SSL cert: identity.

A CA signed cert is considered superior because someone (the CA) has said "Yes, the people to whom we've sold this cert have convinced us they are who they say they are". This convincing is sometimes little more than presenting some money to the CA. What makes the browser trust a given CA? That would be its configured store of trusted root certificates. For example, in Firefox3, if you go to Options > Advanced > Encryption and select View Certificates you can see the pre-installed trusted certificates under the Authorities tab. Provided a certificate has a chain of signatures leading back to one of these Authorities then Firefox will accept that it is legitimately signed.

To make the browser completely happy a certificate has to pass the following tests:

1) Valid signature
2) The Common Name needs to match the hostname you're trying to hit
3) The certificate has to be within its valid time period

A self-signed cert can match all of those criteria, provided you configure the browser to accept it as an Authority certificate.

Back to the original question... is it safe to work with a certificate which your browser has flagged as problematic. The answer is yes, if the problem is expected, such as hitting the self-signed cert on a new Plesk installation. Where you should be concerned is if a certificate that SHOULD be good, such as your bank, is causing the browser to complain. In that case further investigation is definitely warranted. It could be just a glitch or misconfiguration. It could also be someone trying to impersonate the target site.

Until next time... go forth and encrypt everything!

June 15, 2009

Help Us Help You

Working the System Admin queue in the middle of the night I see lots of different kinds of tickets. One thing that has become clear over the months is that a well formed ticket is a happy ticket and a quickly resolved one. What makes a well-formed ticket? Mostly it is all about information and attention to these few suggestions can do a great deal toward speeding your ticket toward a conclusion.

Category
When you create a ticket you're asked to choose a category for it, such as "Portal Information Question" or "Reboots and Remote Access". Selecting the proper category helps us to triage the tickets. If you're locked out of your server, say due to a firewall configuration, you'd use "Reboots and Remote Access". We have certain guys who are better at CDNLayer tickets, for example, and they will seek out those kind so if you have a CDN question, you'd be best served by using that category. Avoid using Sales and Accounting tickets for technical issues as those end up first in their respective departments and not in support.

Login Information
This one is a bit controversial. I'm going to state straight out... I get that some people don't want us knowing the login information for the server. My personal server at SoftLayer doesn't have up-to-date login information in the portal. I do this knowing that this could slow things down if I ever had to have one of the guys take a look at it while I'm not at work.

If necessary, we can ask for it in the ticket but that can cost you time that we could otherwise be addressing your issue. If you would like us to log into your server for assistance, please provide us with valid login information in the ticket form. Providing up-to-date login credentials will greatly expedite the troubleshooting process and mitigate any potential downtime, but is not a requirement for us to help with issues you may be facing.

Server Identification
If you have multiple servers with us, please make sure to clearly identify the system involved in the issue. If we have a doubt, we're going to stop and ask you, which again can cost you time.

Problem Description
This is really the big one. When typing up the problem description in the ticket please provide as much detail as you can. Each sentence of information about the issue can cut out multiple troubleshooting steps which is going to lead to a faster resolution for you.

Example:

  • Not-so-good: I cannot access my server!
  • Good: I was making adjustments to the Windows 2008 firewall on my server and I denied my home IP of 1.2.3.4 instead of allowing it. Please fix.

The tickets describe the same symptom. I can guarantee though we're going to have the second customer back into his server quicker because we have good information about the situation and can go straight to the source of the problem.

Categories: 
Subscribe to Author Archive: %