Posts Tagged 'Monitoring'

September 16, 2013

Sysadmin Tips and Tricks - Using strace to Monitor System Calls

Linux admins often encounter rogue processes that die without explanation, go haywire without any meaningful log data or fail in other interesting ways without providing useful information that can help troubleshoot the problem. Have you ever wished you could just see what the program is trying to do behind the scenes? Well, you can — strace (system trace) is very often the answer. It is included with most distros' package managers, and the syntax should be pretty much identical on any Linux platform.

First, let's get rid of a misconception: strace is not a "debugger," and it isn't a programmer's tool. It's a system administrator's tool for monitoring system calls and signals. It doesn't involve any sophisticated configurations, and you don't have to learn any new commands ... In fact, the most common uses of strace involve the bash commands you learned the early on:

  • read
  • write
  • open
  • close
  • stat
  • fork
  • execute (execve)
  • chmod
  • chown

 

You simply "attach" strace to the process, and it will display all the system calls and signals resulting from that process. Instead of executing the command's built-in logic, strace just makes the process's normal calls to the system and returns the results of the command with any errors it encountered. And that's where the magic lies.

Let's look an example to show that behavior in action. First, become root — you'll need to be root for strace to function properly. Second, make a simple text file called 'test.txt' with these two lines in it:

# cat test.txt
Hi I'm a text file
there are only these two lines in me.

Now, let's execute the cat again via strace:

$ strace cat test.txt 
execve("/bin/cat", ["cat", "test.txt"], [/* 22 vars */]) = 0
brk(0)  = 0x9b7b000
uname({sys="Linux", node="ip-208-109-127-49.ip.secureserver.net", ...}) = 0
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=30671, ...}) = 0
mmap2(NULL, 30671, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7f35000
close(3) = 0
open("/lib/libc.so.6", O_RDONLY) = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0000_\1\0004\0\0\0"..., 512) = 512
fstat64(3, {st_mode=S_IFREG|0755, st_size=1594552, ...}) = 0
mmap2(NULL, 1320356, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xb7df2000
mmap2(0xb7f2f000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x13c) = 0xb7f2f000
mmap2(0xb7f32000, 9636, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xb7f32000
close(3) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7df1000
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7df0000
set_thread_area({entry_number:-1 -> 6, base_addr:0xb7df1b80, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}) = 0
mprotect(0xb7f2f000, 8192, PROT_READ) = 0
mprotect(0xb7f57000, 4096, PROT_READ) = 0
munmap(0xb7f35000, 30671) = 0
brk(0)  = 0x9b7b000
brk(0x9b9c000) = 0x9b9c000
fstat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0
open("test.txt", O_RDONLY|O_LARGEFILE) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=57, ...}) = 0
read(3, "Hi I'm a text file\nthere are onl"..., 4096) = 57
write(1, "Hi I'm a text file\nthere are onl"..., 57Hi I’m a text file
there are only these two lines in me.
) = 57
read(3, "", 4096) = 0
close(3) = 0
close(1) = 0
exit_group(0) = ?

Now that return may look really arcane, but if you study it a little bit, you'll see that it includes lots of information that even an ordinary admin can easily understand. The first line returned includes the execve system call where we'd execute /bin/cat with the parameter of test.txt. After that, you'll see the cat binary attempt to open some system libraries, and the brk and mmap2 calls to allocate memory. That stuff isn't usually particularly useful in the context we're working in here, but it's important to understand what's going on. What we're most interested in are often open calls:

open("test.txt", O_RDONLY|O_LARGEFILE) = 3

It looks like when we run cat test.txt, it will be opening "test.txt", doesn't it? In this situation, that information is not very surprising, but imagine if you are in a situation were you don't know what files a given file is trying to open ... strace immediately makes life easier. In this particular example, you'll see that "= 3" at the end, which is a temporary sort of "handle" for this particular file within the strace output. If you see a "read" call with '3' as the first parameter after this, you know it's reading from that file:

read(3, "Hi I'm a text file\nthere are onl"..., 4096) = 57

Pretty interesting, huh? strace defaults to just showing the first 32 or so characters in a read, but it also lets us know that there are 57 characters (including special characters) in the file! After the text is read into memory, we see it writing it to the screen, and delivering the actual output of the text file. Now that's a relatively simplified example, but it helps us understand what's going on behind the scenes.

Real World Example: Finding Log Files

Let's look at a real world example where we'll use strace for a specific purpose: You can't figure out where your Apache logs are being written, and you're too lazy to read the config file (or perhaps you can't find it). Wouldn't it be nice to follow everything Apache is doing when it starts up, including opening all its log files? Well you can:

strace -Ff -o output.txt -e open /etc/init.d/httpd restart

We are executing strace and telling it to follow all forks (-Ff), but this time we'll output to a file (-o output.txt) and only look for 'open' system calls to keep some of the chaff out of the output (-e open), and execute '/etc/init.d/httpd restart'. This will create a file called "output.txt" which we can use to find references to our log files:

#cat output.txt | grep log
[pid 13595] open("/etc/httpd/modules/mod_log_config.so", O_RDONLY) = 4
[pid 13595] open("/etc/httpd/modules/mod_logio.so", O_RDONLY) = 4
[pid 13595] open("/etc/httpd/logs/error_log", O_WRONLY|O_CREAT|O_APPEND|O_LARGEFILE, 0666) = 10
[pid 13595] open("/etc/httpd/logs/ssl_error_log", O_WRONLY|O_CREAT|O_APPEND|O_LARGEFILE, 0666) = 11
[pid 13595] open("/etc/httpd/logs/access_log", O_WRONLY|O_CREAT|O_APPEND|O_LARGEFILE, 0666) = 12
[pid 13595] open("/etc/httpd/logs/cm4msaa7.com", O_WRONLY|O_CREAT|O_APPEND|O_LARGEFILE, 0666) = 13
[pid 13595] open("/etc/httpd/logs/ssl_access_log", O_WRONLY|O_CREAT|O_APPEND|O_LARGEFILE, 0666) = 14
[pid 13595] open("/etc/httpd/logs/ssl_request_log", O_WRONLY|O_CREAT|O_APPEND|O_LARGEFILE, 0666) = 15
[pid 13595] open("/etc/httpd/modules/mod_log_config.so", O_RDONLY) = 9
[pid 13595] open("/etc/httpd/modules/mod_logio.so", O_RDONLY) = 9
[pid 13596] open("/etc/httpd/logs/error_log", O_WRONLY|O_CREAT|O_APPEND|O_LARGEFILE, 0666) = 10
[pid 13596] open("/etc/httpd/logs/ssl_error_log", O_WRONLY|O_CREAT|O_APPEND|O_LARGEFILE, 0666) = 11
open("/etc/httpd/logs/access_log", O_WRONLY|O_CREAT|O_APPEND|O_LARGEFILE, 0666) = 12
open("/etc/httpd/logs/cm4msaa7.com", O_WRONLY|O_CREAT|O_APPEND|O_LARGEFILE, 0666) = 13
open("/etc/httpd/logs/ssl_access_log", O_WRONLY|O_CREAT|O_APPEND|O_LARGEFILE, 0666) = 14
open("/etc/httpd/logs/ssl_request_log", O_WRONLY|O_CREAT|O_APPEND|O_LARGEFILE, 0666) = 15

The log files jump out at you don't they? Because we know that Apache will want to open its log files when it starts, all we have to do is we follow all the system calls it makes when it starts, and we'll find all of those files. Easy, right?

Real World Example: Locating Errors and Failures

Another valuable use of strace involves looking for errors. If a program fails when it makes a system call, you'll want to be able pinpoint any errors that might have caused that failure as you troubleshoot. In all cases where a system call fails, strace will return a line with "= -1" in the output, followed by an explanation. Note: The space before -1 is very important, and you'll see why in a moment.

For this example, let's say Apache isn't starting for some reason, and the logs aren't telling ua anything about why. Let's run strace:

strace -Ff -o output.txt -e open /etc/init.d/httpd start

Apache will attempt to restart, and when it fails, we can grep our output.txt for '= -1' to see any system calls that failed:

$ cat output.txt | grep '= -1'
[pid 13748] open("/etc/selinux/config", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
[pid 13748] open("/usr/lib/perl5/5.8.8/i386-linux-thread-multi/CORE/tls/i686/sse2/libperl.so", O_RDONLY) = -1 ENOENT (No such file or directory)
[pid 13748] open("/usr/lib/perl5/5.8.8/i386-linux-thread-multi/CORE/tls/i686/libperl.so", O_RDONLY) = -1 ENOENT (No such file or directory)
[pid 13748] open("/usr/lib/perl5/5.8.8/i386-linux-thread-multi/CORE/tls/sse2/libperl.so", O_RDONLY) = -1 ENOENT (No such file or directory)
[pid 13748] open("/usr/lib/perl5/5.8.8/i386-linux-thread-multi/CORE/tls/libperl.so", O_RDONLY) = -1 ENOENT (No such file or directory)
[pid 13748] open("/usr/lib/perl5/5.8.8/i386-linux-thread-multi/CORE/i686/sse2/libperl.so", O_RDONLY) = -1 ENOENT (No such file or directory)
[pid 13748] open("/usr/lib/perl5/5.8.8/i386-linux-thread-multi/CORE/i686/libperl.so", O_RDONLY) = -1 ENOENT (No such file or directory)
[pid 13748] open("/usr/lib/perl5/5.8.8/i386-linux-thread-multi/CORE/sse2/libperl.so", O_RDONLY) = -1 ENOENT (No such file or directory)
[pid 13748] open("/usr/lib/perl5/5.8.8/i386-linux-thread-multi/CORE/libnsl.so.1", O_RDONLY) = -1 ENOENT (No such file or directory)
[pid 13748] open("/usr/lib/perl5/5.8.8/i386-linux-thread-multi/CORE/libutil.so.1", O_RDONLY) = -1 ENOENT (No such file or directory)
[pid 13748] open("/etc/gai.conf", O_RDONLY) = -1 ENOENT (No such file or directory)
[pid 13748] open("/etc/httpd/logs/error_log", O_WRONLY|O_CREAT|O_APPEND|O_LARGEFILE, 0666) = -1 EACCES (Permission denied)

With experience, you'll come to understand which errors matter and which ones don't. Most often, the last error is the most significant. The first few lines show the program trying different libraries to see if they are available, so they don't really matter to us in our pursuit of what's going wrong with our Apache restart, so we scan down and find that the last line:

[pid 13748] open("/etc/httpd/logs/error_log", O_WRONLY|O_CREAT|O_APPEND|O_LARGEFILE, 0666) = -1 EACCES (Permission denied)

After a little investigation on that file, I see that some maniac as set the immutable attribute:

lsattr /etc/httpd/logs/error_log
----i-------- /etc/httpd/logs/error_log

Our error couldn't be found in the log file because Apache couldn't open it! You can imagine how long it might take to figure out this particular problem without strace, but with this useful tool, the cause can be found in minutes.

Go and Try It!

All major Linux distros have strace available — just type strace at the command line for the basic usage. If the command is not found, install it via your distribution's package manager. Get in there and try it yourself!

For a fun first exercise, bring up a text editor in one terminal, then strace the editor process in another with the -p flag (strace -p <process_id>) since we want to look at an already-running process. When you go back and type in the text editor, the system calls will be shown in strace as you type ... You see what's happening in real time!

-Lee

November 14, 2012

Risk Management: Securing Your Servers

How do you secure your home when you leave? If you're like most people, you make sure to lock the door you leave from, and you head off to your destination. If Phil is right about "locks keeping honest people honest," simply locking your front door may not be enough. When my family moved into a new house recently, we evaluated its physical security and tried to determine possible avenues of attack (garage, doors, windows, etc.), tools that could be used (a stolen key, a brick, a crowbar, etc.) and ways to mitigate the risk of each kind of attack ... We were effectively creating a risk management plan.

Every risk has different probabilities of occurrence, potential damages, and prevention costs, and the risk management process helps us balance the costs and benefits of various security methods. When it comes to securing a home, the most effective protection comes by using layers of different methods ... To prevent a home invasion, you might lock your door, train your dog to make intruders into chew toys and have an alarm system installed. Even if an attacker can get a key to the house and bring some leftover steaks to appease the dog, the motion detectors for the alarm are going to have the police on their way quickly. (Or you could violate every HOA regulation known to man by digging a moat around the house, filling with sharks with laser beams attached to their heads, and building a medieval drawbridge over the moat.)

I use the example of securing a house because it's usually a little more accessible than talking about "server security." Server security doesn't have to be overly complex or difficult to implement, but its stigma of complexity usually prevents systems administrators from incorporating even the simplest of security measures. Let's take a look at the easiest steps to begin securing your servers in the context of their home security parallels, and you'll see what I'm talking about.

Keep "Bad People" Out: Have secure password requirements.

Passwords are your keys and your locks — the controls you put into place that ensure that only the people who should have access get it. There's no "catch all" method of keeping the bad people out of your systems, but employing a variety of authentication and identification measures can greatly enhance the security of your systems. A first line of defense for server security would be to set password complexity and minimum/maximum password age requirements.

If you want to add an additional layer of security at the authentication level, you can incorporate "Strong" or "Two-Factor" authentication. From there, you can learn about a dizzying array of authentication protocols (like TACACS+ and RADIUS) to centralize access control or you can use active directory groups to simplify the process of granting and/or restricting access to your systems. Each layer of authentication security has benefits and drawbacks, and most often, you'll want to weigh the security risk against your need for ease-of-use and availability as you plan your implementation.

Stay Current on your "Good People": When authorized users leave, make sure their access to your system leaves with them.

If your neighbor doesn't return borrowed tools to your tool shed after you gave him a key when he was finishing his renovation, you need to take his key back when you tell him he can't borrow any more. If you don't, nothing is stopping him from walking over to the shed when you're not looking and taking more (all?) of your tools. I know it seems like a silly example, but that kind of thing is a big oversight when it comes to server security.

Employees are granted access to perform their duties (the principle of least privilege), and when they no longer require access, the "keys to the castle" should be revoked. Auditing who has access to what (whether it be for your systems or for your applications) should be continual.

You might have processes in place to grant and remove access, but it's also important to audit those privileges regularly to catch any breakdowns or oversights. The last thing you want is to have a disgruntled former employee wreak all sorts of havoc on your key systems, sell proprietary information or otherwise cost you revenue, fines, recovery efforts or lost reputation.

Catch Attackers: Monitor your systems closely and set up alerts if an intrusion is detected.

There is always a chance that bad people are going to keep looking for a way to get into your house. Maybe they'll walk around the house to try and open the doors and windows you don't use very often. Maybe they'll ring the doorbell and if no lights turn on, they'll break a window and get in that way.

You can never completely eliminate all risk. Security is a continual process, and eventually some determined, over-caffeinated hacker is going to find a way in. Thinking your security is impenetrable makes you vulnerable if by some stretch of the imagination, an attacker breaches your security (see: Trojan Horse). Continuous monitoring strategies can alert administrators if someone does things they shouldn't be doing. Think of it as a motion detector in your house ... "If someone gets in, I want to know where they are." When you implement monitoring, logging and alerting, you will also be able to recover more quickly from security breaches because every file accessed will be documented.

Minimize the Damage: Lock down your system if it is breached.

A burglar smashes through your living room window, runs directly to your DVD collection, and takes your limited edition "Saved by the Bell" series box set. What can you do to prevent them from running back into the house to get the autographed posted of Alf off of your wall?

When you're monitoring your servers and you get alerted to malicious activity, you're already late to the game ... The damage has already started, and you need to minimize it. In a home security environment, that might involve an ear-piercing alarm or filling the moat around your house even higher so the sharks get a better angle to aim their laser beams. File integrity monitors and IDS software can mitigate damage in a security breach by reverting files when checksums don't match or stopping malicious behavior in its tracks.

These recommendations are only a few of the first-line layers of defense when it comes to server security. Even if you're only able to incorporate one or two of these tips into your environment, you should. When you look at server security in terms of a journey rather than a destination, you can celebrate the progress you make and look forward to the next steps down the road.

Now if you'll excuse me, I have to go to a meeting where I'm proposing moats, drawbridges, and sharks with laser beams on their heads to SamF for data center security ... Wish me luck!

-Matthew

June 9, 2011

Postling: Tech Partner Spotlight

This is a guest blog with David Lifson from our partner Postling. Postling is an ideal social media management tool for small businesses. Postling's dashboard allows the user to take control of their online presence by aggregating all of their social media accounts in one place. David will be sharing some social media tips and tricks in a separate blog in the near future.

This guest blog series highlights companies in SoftLayer's Technology Partners Marketplace.
These Partners have built their businesses on the SoftLayer Platform, and we're excited for them to tell their stories. New Partners will be added to the Marketplace each month, so stay tuned for many more come.
April 27, 2011

AppFirst: Tech Partner Spotlight

This is a guest blog from AppFirst, a SoftLayer Tech Marketplace Partner specializing in managing servers and applications with a SaaS-based monitoring solution.

How You Should Approach Monitoring in the Cloud

Monitoring in the cloud may sound like it's easy, but there's one important thing you need to know before you get started: traditional monitoring techniques simply don't work when you're in the cloud.

"But why?" you may ask. "Why can't I use Polling and Byte Code Injection in my cloud infrastructure?"

With Polling, you miss incidents between intervals, you only get the data that you requested, and you can only monitor parts of the application but not the whole thing. If you choose to use Polling for your cloud monitoring, you'll have to deal with missing important data you need.

And with Byte Code Injection, you only get data from within the language run-time, meaning you don't have the real data of what is happening across your application stack. It is inferred.

Using our own product on our production systems, we have learned three lessons about running in the cloud.

Lesson #1: Visibility = Control
By definition, running in the cloud means you are running in a shared environment. You don't have the CPU cycles your operating system reports you have, and sometimes, the hypervisor will throttle you. In our experience, some cloud vendors are much better at managing this than others. When running in some clouds, we've had huge variations in performance throughout the day, significantly impacting our end-users experience. One of the reasons we chose SoftLayer was because we didn't see those kinds of variances.

The reality is until you have visibility into what your application truly needs in terms of resources, you don't have control of your application and your user's experience. According to an Aberdeen study, 68% of the time IT finds out about application issues from end users. Don't let this be you!

Lesson #2: It's Okay to Use Local Storage
The laws of physics reign, so the disk is always the slowest piece. No getting around the fact there are physical elements involved like spindles and disks spinning. And then when you share it, as you do in the cloud, there can be other issues ... It all depends on the characteristics of your application. If it's serving up lots of static data, then cloud-based storage can most likely work for you. However, if you have lots of dynamic, small chunks of data, you are probably best served by using local storage. This is the architecture we had to go with given the nature of our application.

With servers around the world streaming application behavior data to our production system all the time and needing to process it to make it available in a browser, we had to use local storage. In case you are interested in reading more on this and RAM based designs here are some posts:

Lesson #3: Know the Profile of Your Subsystems
Knowing the profile of your subsystems and what they need in terms of resources is imperative to have the best performing application. A cloud-only deployment may not be right for you; hybrid (cloud and dedicated physical servers) might work better.

As we discussed in Lesson #2 you might need to have local, persistent storage. Again, some vendors do this better than others. SoftLayer, in our experience, has a very good, high bandwidth connection between their cloud and physical infrastructure. But you can't make these decisions in a vacuum. You need the data to tell you what parts of your application are network heavy, CPU intensive, and require a lot of memory in certain circumstances. We have learned a lot from using our own application on our production system. It's very quick and easy for you to start learning about the profile of your application too.

We are constantly learning more about deploying in the cloud, NoSQL databases, scalable architectures, and more. Check out the AppFirst blog regularly for the latest.

We'd like to give a special shout out thanks to SoftLayer! We're honored to be one of your launch partners in the new Technology Partners Marketplace.

-AppFirst

This guest blog series highlights companies in SoftLayer's Technology Partners Marketplace.
These Partners have built their businesses on the SoftLayer Platform, and we're excited for them to tell their stories. New Partners will be added to the Marketplace each month, so stay tuned for many more come.
Categories: 
October 18, 2010

Another Day. Another Product.

Today, SoftLayer released an Advanced Monitoring Solution based on Nimsoft's Monitoring software suite. In a nutshell the product will give SoftLayer visibility onto a customer's server at the OS level. In addition to the great product benefit the customers receive, it will add tools to our sales and support staff to troubleshoot, diagnose and systems design.

The core product works through a piece of software that gets installed on a customer server called a robot. The robot in turn allows probes to be run on the server. The different probes collect various data points from the OS and applications. As the probes collect data they pass the information onto some intermediate backend servers, and eventually end up on our brand new HBase data warehouse (HBase is the massively scaled database for large amounts (petabytes) of data). This is the corner piece for the scalability of the offering. So, robot is the main software and the probes are the application watchers that run inside the robot framework.

There are additional features outside of the process mentioned called 'Offbox Monitors' or 'Offbox Probes.' These are probes that live on servers in the SoftLayer data center. The idea behind these is that we are able to let customers have network services they want to monitor from a remote location. An example would be url_reponse, which monitors if a url is active and passing data (along with some other pieces of data people might be interested in like response time).

What it can monitor? The better question might be what can't it monitor? The SoftLayer offer comes in three packages - Basic, Advanced and Premium. Basic is a free package that monitors core hardware (CPU, memory, disk) along with simple process and services. Advanced moves into deeper system monitoring, and Premium adds more application monitoring (like databases, web services etc.). This offering is available for hardware, Monthly CCI's and hourly CCI's - basically for everything we sell. Customers can order the software from all order forms (external, internal, cci, server etc.) as well as add the service post deployment from the advanced monitoring portal page.

The service offering has two distinctive reporting features that we call graphing and alarms. Graphing allows customers to (yep, you guessed it) graph all the data we collect. For example, we can show a graph of CPU usage over time. Alarms are notifications that services are outside of a predetermined range. For example, you could setup an alarm when CPU usage goes above 90%. Alarms can be tracked from the customer portal, or email alerts (SoftLayer calls this list 'Alarm Subscribers') can be setup by the customer.

All the features of the product are accessed via the customer portal, or via SoftLayer's internal portal. Configuration, graphing, and alarm management can all be done from one management page in the customer portal. In near real-time customers can change configurations directly on their server or cloud computing instance (CCI) for the various data points they want graph and alarm. It's pretty slick, and it adds to the SoftLayer secret sauce. We have also added a feature that allows the customer to save configurations on a particular server and “redeploy” them on different servers or future servers they may add. This feature makes it easier to scale and customize for a particular customer's needs.

As time goes on we will continue to add more probes and more features. This is just the beginning - make no mistake it's pretty damn cool.

-@quigleymar

Subscribe to monitoring