Posts Tagged 'Errors'

September 16, 2013

Sysadmin Tips and Tricks - Using strace to Monitor System Calls

Linux admins often encounter rogue processes that die without explanation, go haywire without any meaningful log data or fail in other interesting ways without providing useful information that can help troubleshoot the problem. Have you ever wished you could just see what the program is trying to do behind the scenes? Well, you can — strace (system trace) is very often the answer. It is included with most distros' package managers, and the syntax should be pretty much identical on any Linux platform.

First, let's get rid of a misconception: strace is not a "debugger," and it isn't a programmer's tool. It's a system administrator's tool for monitoring system calls and signals. It doesn't involve any sophisticated configurations, and you don't have to learn any new commands ... In fact, the most common uses of strace involve the bash commands you learned the early on:

  • read
  • write
  • open
  • close
  • stat
  • fork
  • execute (execve)
  • chmod
  • chown

 

You simply "attach" strace to the process, and it will display all the system calls and signals resulting from that process. Instead of executing the command's built-in logic, strace just makes the process's normal calls to the system and returns the results of the command with any errors it encountered. And that's where the magic lies.

Let's look an example to show that behavior in action. First, become root — you'll need to be root for strace to function properly. Second, make a simple text file called 'test.txt' with these two lines in it:

# cat test.txt
Hi I'm a text file
there are only these two lines in me.

Now, let's execute the cat again via strace:

$ strace cat test.txt 
execve("/bin/cat", ["cat", "test.txt"], [/* 22 vars */]) = 0
brk(0)  = 0x9b7b000
uname({sys="Linux", node="ip-208-109-127-49.ip.secureserver.net", ...}) = 0
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=30671, ...}) = 0
mmap2(NULL, 30671, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7f35000
close(3) = 0
open("/lib/libc.so.6", O_RDONLY) = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0000_\1\0004\0\0\0"..., 512) = 512
fstat64(3, {st_mode=S_IFREG|0755, st_size=1594552, ...}) = 0
mmap2(NULL, 1320356, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xb7df2000
mmap2(0xb7f2f000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x13c) = 0xb7f2f000
mmap2(0xb7f32000, 9636, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xb7f32000
close(3) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7df1000
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7df0000
set_thread_area({entry_number:-1 -> 6, base_addr:0xb7df1b80, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}) = 0
mprotect(0xb7f2f000, 8192, PROT_READ) = 0
mprotect(0xb7f57000, 4096, PROT_READ) = 0
munmap(0xb7f35000, 30671) = 0
brk(0)  = 0x9b7b000
brk(0x9b9c000) = 0x9b9c000
fstat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0
open("test.txt", O_RDONLY|O_LARGEFILE) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=57, ...}) = 0
read(3, "Hi I'm a text file\nthere are onl"..., 4096) = 57
write(1, "Hi I'm a text file\nthere are onl"..., 57Hi I’m a text file
there are only these two lines in me.
) = 57
read(3, "", 4096) = 0
close(3) = 0
close(1) = 0
exit_group(0) = ?

Now that return may look really arcane, but if you study it a little bit, you'll see that it includes lots of information that even an ordinary admin can easily understand. The first line returned includes the execve system call where we'd execute /bin/cat with the parameter of test.txt. After that, you'll see the cat binary attempt to open some system libraries, and the brk and mmap2 calls to allocate memory. That stuff isn't usually particularly useful in the context we're working in here, but it's important to understand what's going on. What we're most interested in are often open calls:

open("test.txt", O_RDONLY|O_LARGEFILE) = 3

It looks like when we run cat test.txt, it will be opening "test.txt", doesn't it? In this situation, that information is not very surprising, but imagine if you are in a situation were you don't know what files a given file is trying to open ... strace immediately makes life easier. In this particular example, you'll see that "= 3" at the end, which is a temporary sort of "handle" for this particular file within the strace output. If you see a "read" call with '3' as the first parameter after this, you know it's reading from that file:

read(3, "Hi I'm a text file\nthere are onl"..., 4096) = 57

Pretty interesting, huh? strace defaults to just showing the first 32 or so characters in a read, but it also lets us know that there are 57 characters (including special characters) in the file! After the text is read into memory, we see it writing it to the screen, and delivering the actual output of the text file. Now that's a relatively simplified example, but it helps us understand what's going on behind the scenes.

Real World Example: Finding Log Files

Let's look at a real world example where we'll use strace for a specific purpose: You can't figure out where your Apache logs are being written, and you're too lazy to read the config file (or perhaps you can't find it). Wouldn't it be nice to follow everything Apache is doing when it starts up, including opening all its log files? Well you can:

strace -Ff -o output.txt -e open /etc/init.d/httpd restart

We are executing strace and telling it to follow all forks (-Ff), but this time we'll output to a file (-o output.txt) and only look for 'open' system calls to keep some of the chaff out of the output (-e open), and execute '/etc/init.d/httpd restart'. This will create a file called "output.txt" which we can use to find references to our log files:

#cat output.txt | grep log
[pid 13595] open("/etc/httpd/modules/mod_log_config.so", O_RDONLY) = 4
[pid 13595] open("/etc/httpd/modules/mod_logio.so", O_RDONLY) = 4
[pid 13595] open("/etc/httpd/logs/error_log", O_WRONLY|O_CREAT|O_APPEND|O_LARGEFILE, 0666) = 10
[pid 13595] open("/etc/httpd/logs/ssl_error_log", O_WRONLY|O_CREAT|O_APPEND|O_LARGEFILE, 0666) = 11
[pid 13595] open("/etc/httpd/logs/access_log", O_WRONLY|O_CREAT|O_APPEND|O_LARGEFILE, 0666) = 12
[pid 13595] open("/etc/httpd/logs/cm4msaa7.com", O_WRONLY|O_CREAT|O_APPEND|O_LARGEFILE, 0666) = 13
[pid 13595] open("/etc/httpd/logs/ssl_access_log", O_WRONLY|O_CREAT|O_APPEND|O_LARGEFILE, 0666) = 14
[pid 13595] open("/etc/httpd/logs/ssl_request_log", O_WRONLY|O_CREAT|O_APPEND|O_LARGEFILE, 0666) = 15
[pid 13595] open("/etc/httpd/modules/mod_log_config.so", O_RDONLY) = 9
[pid 13595] open("/etc/httpd/modules/mod_logio.so", O_RDONLY) = 9
[pid 13596] open("/etc/httpd/logs/error_log", O_WRONLY|O_CREAT|O_APPEND|O_LARGEFILE, 0666) = 10
[pid 13596] open("/etc/httpd/logs/ssl_error_log", O_WRONLY|O_CREAT|O_APPEND|O_LARGEFILE, 0666) = 11
open("/etc/httpd/logs/access_log", O_WRONLY|O_CREAT|O_APPEND|O_LARGEFILE, 0666) = 12
open("/etc/httpd/logs/cm4msaa7.com", O_WRONLY|O_CREAT|O_APPEND|O_LARGEFILE, 0666) = 13
open("/etc/httpd/logs/ssl_access_log", O_WRONLY|O_CREAT|O_APPEND|O_LARGEFILE, 0666) = 14
open("/etc/httpd/logs/ssl_request_log", O_WRONLY|O_CREAT|O_APPEND|O_LARGEFILE, 0666) = 15

The log files jump out at you don't they? Because we know that Apache will want to open its log files when it starts, all we have to do is we follow all the system calls it makes when it starts, and we'll find all of those files. Easy, right?

Real World Example: Locating Errors and Failures

Another valuable use of strace involves looking for errors. If a program fails when it makes a system call, you'll want to be able pinpoint any errors that might have caused that failure as you troubleshoot. In all cases where a system call fails, strace will return a line with "= -1" in the output, followed by an explanation. Note: The space before -1 is very important, and you'll see why in a moment.

For this example, let's say Apache isn't starting for some reason, and the logs aren't telling ua anything about why. Let's run strace:

strace -Ff -o output.txt -e open /etc/init.d/httpd start

Apache will attempt to restart, and when it fails, we can grep our output.txt for '= -1' to see any system calls that failed:

$ cat output.txt | grep '= -1'
[pid 13748] open("/etc/selinux/config", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
[pid 13748] open("/usr/lib/perl5/5.8.8/i386-linux-thread-multi/CORE/tls/i686/sse2/libperl.so", O_RDONLY) = -1 ENOENT (No such file or directory)
[pid 13748] open("/usr/lib/perl5/5.8.8/i386-linux-thread-multi/CORE/tls/i686/libperl.so", O_RDONLY) = -1 ENOENT (No such file or directory)
[pid 13748] open("/usr/lib/perl5/5.8.8/i386-linux-thread-multi/CORE/tls/sse2/libperl.so", O_RDONLY) = -1 ENOENT (No such file or directory)
[pid 13748] open("/usr/lib/perl5/5.8.8/i386-linux-thread-multi/CORE/tls/libperl.so", O_RDONLY) = -1 ENOENT (No such file or directory)
[pid 13748] open("/usr/lib/perl5/5.8.8/i386-linux-thread-multi/CORE/i686/sse2/libperl.so", O_RDONLY) = -1 ENOENT (No such file or directory)
[pid 13748] open("/usr/lib/perl5/5.8.8/i386-linux-thread-multi/CORE/i686/libperl.so", O_RDONLY) = -1 ENOENT (No such file or directory)
[pid 13748] open("/usr/lib/perl5/5.8.8/i386-linux-thread-multi/CORE/sse2/libperl.so", O_RDONLY) = -1 ENOENT (No such file or directory)
[pid 13748] open("/usr/lib/perl5/5.8.8/i386-linux-thread-multi/CORE/libnsl.so.1", O_RDONLY) = -1 ENOENT (No such file or directory)
[pid 13748] open("/usr/lib/perl5/5.8.8/i386-linux-thread-multi/CORE/libutil.so.1", O_RDONLY) = -1 ENOENT (No such file or directory)
[pid 13748] open("/etc/gai.conf", O_RDONLY) = -1 ENOENT (No such file or directory)
[pid 13748] open("/etc/httpd/logs/error_log", O_WRONLY|O_CREAT|O_APPEND|O_LARGEFILE, 0666) = -1 EACCES (Permission denied)

With experience, you'll come to understand which errors matter and which ones don't. Most often, the last error is the most significant. The first few lines show the program trying different libraries to see if they are available, so they don't really matter to us in our pursuit of what's going wrong with our Apache restart, so we scan down and find that the last line:

[pid 13748] open("/etc/httpd/logs/error_log", O_WRONLY|O_CREAT|O_APPEND|O_LARGEFILE, 0666) = -1 EACCES (Permission denied)

After a little investigation on that file, I see that some maniac as set the immutable attribute:

lsattr /etc/httpd/logs/error_log
----i-------- /etc/httpd/logs/error_log

Our error couldn't be found in the log file because Apache couldn't open it! You can imagine how long it might take to figure out this particular problem without strace, but with this useful tool, the cause can be found in minutes.

Go and Try It!

All major Linux distros have strace available — just type strace at the command line for the basic usage. If the command is not found, install it via your distribution's package manager. Get in there and try it yourself!

For a fun first exercise, bring up a text editor in one terminal, then strace the editor process in another with the -p flag (strace -p <process_id>) since we want to look at an already-running process. When you go back and type in the text editor, the system calls will be shown in strace as you type ... You see what's happening in real time!

-Lee

July 25, 2008

Thinkin' Like a Programmer

"I can't figure this out. My email client says I can't attach more than 10 M of data, but then it says I have 16501 K of data attached, and it can't send that. What's a 'M'? What's a 'K'? Why is the second number so big? I only attached a few files!"

I explained to my uncle that "M" stood for Megabyte, and "K" stood for Kilobyte. That a simple calculation to convert "K"s to "M"s was to take the last three digits off the "K"s and you had the size in "M"s, give or take one. That he had 16-17 Megabytes of data attached to his email, and he can only have 10.

His response was to wonder (1) why didn't the client just tell him he had "too much data," and (2) why did the program give him Ms AND Ks, instead of picking one?

My reply consisted of (1) it did, that's what the message said, and (2) because the programmer was thinking like a programmer.

See, my uncle is a very, very smart man. He worked in a video arcade as the guy who rewired the arcade machines when they exploded when somebody poured a Coke on them. He knew how the machines worked in and out. And got paid good money. When he moved back to Texas, he took up industrial and residential electric work, and is now a fully licensed foreman who's in high demand all through the area. When he says "I won't take a job that pays less than $20 a hour," it's not because he's picky, it's because he doesn't have to. Sharp as a tack. But he's not a computer pro. Not a problem, people can't be pros at everything. This ain't the 1700s, where you can pick up a test tube and learn everything known about chemistry in a few days.

But why would a programmer write a error message for an email program that would be unreadable to end users? Because it's perfectly readable to him! When my uncle read out the message, my first response was "You have about 7 Megabytes too many attachments. Send a second email."

Therefore, a programmer checking his work would think this was a great error message. Not only does it tell you that the email can't be sent, but it tells you why. The limit is in Megabytes, but email messages are typically sent in Kilobytes, so the data is already there. See how helpful I am! And the unit conversion between Ks and Ms are very easy; programmers do it 10 times a day and wouldn't even notice it.

That's why we have end user testing, to try to catch these things that programmers won't notice. It's just a simple conversion of units! But for an electrician trying to send an email, it was as opaque as to him as if he had told me that I had a single pole dual throw make-break when I need a dual pole single throw break-make. It makes perfect sense, if you're used to it. And if I think about it for a minute, I could figure it out most likely… but the point is, his error message is useless to me as it's formatted. But it makes perfect sense to him.

So, what's the moral of the story? Well, moral 1 is, try to be sure that all users of your product can understand what you say. We have an extensive testing process here at SoftLayer to make sure our data screens are usable without any confusion. Moral 2 is that programmers don't "actively" attempt to "keep people from using their computers" by "making their programs too complex." For us, it's completely transparent and useful, as useful to us as a circuit diagram is to an electrician. Just let us know if we make something a bit to opaque; it wasn't on purpose, and sometimes it's an easy fix. We were just thinking like programmers.

-Shawn

Categories: 
May 14, 2008

On Site Development

I have a friend who worked at an internationally recognized fast food restaurant. It relies heavily on it’s computer systems to operate. Being a programmer (I’ve always preferred Computer Alchemist), I’ve always been fascinated with the command and control programs used at these restaurants, and have drilled many family members and friends to describe (in the most non-specific non-job-endangering way) what they do with those computers to do their job.

At the fast food restaurant, every order is entered into the cash register at the front of the line, processed by the computer systems on site, the orders are relayed to the kitchen in realtime and displayed on monitors, and various thermal ticket printers spit out meal tickets to be affixed to the various food items to describe their state (Double Cheeseburger, No Pickles, No Onions, Add Secret Sauce). It’s an amazing dance of software.

But I noticed that my friend the grill cook was constantly complaining about the system. Apparently, the order processing is really real time. When the guy up front presses the “double cheeseburger” button, it immediately lights up in the back but it doesn’t alert you that this is an order in progress! So, if the cooks are in a rush and just preparing orders as fast as possible, they might already have the pickles on the burger before the lady up front presses the “No Pickles” button, updating the display. Also, updated items are NOT IDENTIFIED by the system, so if the burger is already wrapped and the burger man is already moving to the next little square, it might not be caught! (Keep this in mind: most order “errors” at this internationally recognized fast food restaurant are caused by slow order takers, or slow customers, or customers who change their mind at the end of an order (Oh, could you make that burger with no onions?). Remember, if the guys in the back are rushing, get your order together in your head before you speak. If you say “Double Cheeseburger, no pickles, no onions, add secret sauce” just like that, it’ll pop into the system the right way!)

My question, of course, was “Why not rewrite the cash registers to be more awesome?” See, the old registers tied into a small computer in the back room that tied into the monitors in the kitchen. All state was held on that one machine, so obviously as soon as it was updated, it would appear on the monitors. But I couldn’t figure out why they kept using this stupid backwards system. It’s 2008! When I worked in the local greasepit in my small town, we used paper tickets we were more efficient than this fast food restaurant’s system! So why haven’t they updated? Simple: the people who can fix it aren’t there to use it.

Aha! That’s what makes SoftLayer so awesome. See, our programming staff is right here, in the same office as everyone else (except for those people who work in our datacenters, but they have our internal IM, email, and phone numbers and know they can contact us at any time.). If something breaks, or if something is built in a backwards or strange way, we know immediately and can turn around and fix it. In fact, we have a whole lot of programmers, as a ratio of developers to normal people. This high ratio allows us to have the best control portal in the world, and to add features quickly. See, if only developers worked at that fast food restaurant, and saw how the system was used, and were allowed to make changes (another awesome bit about SoftLayer: management is open to change. You’d be surprised how many obvious changes simply are not allowed by management in some dev shops…), then it would be a much easier place for everyone involved. And maybe I’d get my Double Cheeseburger right. Without the onions.

-Shawn

Categories: 
Subscribe to errors