Article Blog Image

System Call Tracing

Tools

I want to introduce one of the most powerful techniques in our arsenal when supporting production systems: system call tracing. But first: what is a system call?

Simply put, system calls are how programs interact with the operating system to request and manage resources like memory, files, network sockets, and hardware devices.

System call tracing allows you to observe the behavior of running processes and how they use those resources in real time.

Why is such a technique useful?

  • It allows you to learn the behavior of running processes that don’t have source code available
  • It helps you diagnose performance issues of running processes when observability like logs or time series isn’t effective

Note: tracing tools introduce a performance penalty on processes when traced. Be mindful when running tools like this on a running production system where application performance is really important.

Let’s demo some of the tools available! The examples below come from Ubuntu 22.04 LTS so package names may be different on your system.

strace

The most well-known tool is called strace. It should already be installed on Ubuntu systems by default, but can be installed via:

$ sudo apt install strace

This tool is used in one of two ways:

  • execute and trace a new instance of a process;
  • attach to and trace an already-executing process

Tracing a new process

To start, let’s have strace execute and trace a command for us.

The command we’ll use is cat /etc/issue, which will display the contents of the file /etc/issue, which is what Linux systems use to print the console login prompt.

$ cat /etc/issue
Ubuntu 22.04.2 LTS \n \l

Now, let’s have strace run it for us and display all of the syscalls:

$ strace cat /etc/issue
execve("/usr/bin/cat", ["cat", "/etc/issue"], 0x7fffdac80408 /* 49 vars */) = 0
brk(NULL)                           	= 0x557410a5c000
arch_prctl(0x3001 /* ARCH_??? */, 0x7ffd484d9db0) = -1 EINVAL (Invalid argument)
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa6f5080000
access("/etc/ld.so.preload", R_OK)  	= -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=104939, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 104939, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fa6f5066000
close(3)                            	= 0
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0P\237\2\0\0\0\0\0"..., 832) = 832
pread64(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784, 64) = 784
pread64(3, "\4\0\0\0 \0\0\0\5\0\0\0GNU\0\2\0\0\300\4\0\0\0\3\0\0\0\0\0\0\0"..., 48, 848) = 48
pread64(3, "\4\0\0\0\24\0\0\0\3\0\0\0GNU\0i8\235HZ\227\223\333\350s\360\352,\223\340."..., 68, 896) = 68
newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=2216304, ...}, AT_EMPTY_PATH) = 0
pread64(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784, 64) = 784
mmap(NULL, 2260560, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fa6f4e00000
mmap(0x7fa6f4e28000, 1658880, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x28000) = 0x7fa6f4e28000
mmap(0x7fa6f4fbd000, 360448, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1bd000) = 0x7fa6f4fbd000
mmap(0x7fa6f5015000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x214000) = 0x7fa6f5015000
mmap(0x7fa6f501b000, 52816, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fa6f501b000
close(3)                            	= 0
mmap(NULL, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa6f5063000
arch_prctl(ARCH_SET_FS, 0x7fa6f5063740) = 0
set_tid_address(0x7fa6f5063a10)     	= 1156918
set_robust_list(0x7fa6f5063a20, 24) 	= 0
rseq(0x7fa6f50640e0, 0x20, 0, 0x53053053) = 0
mprotect(0x7fa6f5015000, 16384, PROT_READ) = 0
mprotect(0x55740ec42000, 4096, PROT_READ) = 0
mprotect(0x7fa6f50ba000, 8192, PROT_READ) = 0
prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0
munmap(0x7fa6f5066000, 104939)      	= 0
getrandom("\x57\x39\x16\x9e\xd9\xa5\x61\x56", 8, GRND_NONBLOCK) = 8
brk(NULL)                           	= 0x557410a5c000
brk(0x557410a7d000)                 	= 0x557410a7d000
openat(AT_FDCWD, "/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=14575936, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 14575936, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fa6f4000000
close(3)                            	= 0
newfstatat(1, "", {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0x1), ...}, AT_EMPTY_PATH) = 0
openat(AT_FDCWD, "/etc/issue", O_RDONLY) = 3
newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=26, ...}, AT_EMPTY_PATH) = 0
fadvise64(3, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
mmap(NULL, 139264, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa6f5041000
read(3, "Ubuntu 22.04.2 LTS \\n \\l\n\n", 131072) = 26
write(1, "Ubuntu 22.04.2 LTS \\n \\l\n\n", 26Ubuntu 22.04.2 LTS \n \l
) = 26
read(3, "", 131072)                 	= 0
munmap(0x7fa6f5041000, 139264)      	= 0
close(3)                            	= 0
close(1)                            	= 0
close(2)                            	= 0
exit_group(0)                       	= ?
+++ exited with 0 +++

Each line of this output has the name of the system call, the arguments passed, and the value that was returned from its execution.

Let’s inspect the first line of output from the command:

execve("/usr/bin/cat", ["cat", "/etc/issue"], 0x7fffdac80408 /* 49 vars */) = 0

The system call name is called execve(). What does it do? We can find out by using man, which is the manual command for Unix/Linux systems. We tend to use it to read the reference on command line tools, but section 2 contains all of the system calls!

First, let’s make sure the manpages are installed with:

$ sudo apt-install manpages-posix manpages-dev

And now, let’s look up execve() in the section 2 of the manpages:

$ man 2 execve

From this reference, we learn that execve() is a syscall used to execute a program. The first argument is the path to the binary executed, which is /usr/bin/cat, with the argument list cat /etc/issue.

(To list all system calls, run man 2 syscall.)

Let’s go toward the bottom of the output of strace and look at the most interesting parts of the execution:

openat(AT_FDCWD, "/etc/issue", O_RDONLY) = 3
…
read(3, "Ubuntu 22.04.2 LTS \\n \\l\n\n", 131072) = 26
write(1, "Ubuntu 22.04.2 LTS \\n \\l\n\n", 26Ubuntu 22.04.2 LTS \n \l
) = 26
read(3, "", 131072)                 	= 0
…
close(3)                            	= 0
close(1)                            	= 0
close(2)                            	= 0
exit_group(0)                           = ?

  • Syscall openat(), from the current working directory, requested to open the file /etc/issue in read-only mode. It returned successfully with 3, which is a file descriptor(fd for short). Simply, file descriptors are how programs keep track of open files and other resources like network connections!
  • Syscall read() then reads the contents of /etc/issue into a buffer using fd 3, and returns the size of the data read (26 bytes).
  • Syscall write() writes that same data from the buffer into fd 1, which is always used for standard output (stdout). Standard output is how output data is displayed in the terminal. Notice that the output from strace included the actual output from the cat command.
  • Syscall read() attempts to read additional data from fd 3, and gets an empty string.
  • Syscall close() is called multiple times to close fd 3 for /etc/issue, fd 2 for standard error(stderr), and fd 1 for standard output(stdout). Stderr is used by programs to write error messages to the terminal output. This allows users to treat command output and error messages separately.
  • Syscall exit_group() ends execution for the current executing context as well as any running threads.

Attaching to a running process

Now that you can trace a process directly executed from the command line, let’s explore tracing an already-running process!

Let’s pretend that we have a process called exampledaemon.py that we are investigating that is confirmed to be slow.

First, we will need the process id (PID) of the currently running instance of exampledaemon.py. We can find that with:

$ pgrep --full exampledaemon
2644897

pgrep simply displays all of the PIDs that match the pattern provided. We get PID 2644897, so now let’s trace it:

$ sudo strace -f -p 2644897
strace: Process 2644897 attached
read(3,

We see that syscall read() is being executed against fd 3 but has not yet been completed. This is a clue of what is slowing down execution, but what does fd 3 correspond to?

Time to introduce a new tool: lsof, which stands for ‘list open files’. Using the PID of exampledaemon.py, we can dump all of the file descriptors currently open for that process:

$ lsof -p 2644897
COMMAND 	PID USER   FD   TYPE   DEVICE SIZE/OFF   NODE NAME
python3 2644897 amin  cwd	DIR  	8,0 	4096 275513 /home/amin
python3 2644897 amin  rtd	DIR  	8,0 	4096  	2 /
python3 2644897 amin  txt	REG  	8,0  5494584   1794 /usr/bin/python3.8
python3 2644897 amin  mem	REG  	8,0   598104  10981 /usr/lib/x86_64-linux-gnu/libssl.so.1.1
python3 2644897 amin  mem	REG  	8,0   186344  13492 /usr/lib/python3.8/lib-dynload/_ssl.cpython-38-x86_64-linux-gnu.so
python3 2644897 amin  mem	REG  	8,0   224008   6738 /usr/lib/x86_64-linux-gnu/libmpdec.so.2.4.2
python3 2644897 amin  mem	REG  	8,0   180568  20690 /usr/lib/python3.8/lib-dynload/_decimal.cpython-38-x86_64-linux-gnu.so
python3 2644897 amin  mem	REG  	8,0  2954080  10980 /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1
python3 2644897 amin  mem	REG  	8,0	41544  13490 /usr/lib/python3.8/lib-dynload/_hashlib.cpython-38-x86_64-linux-gnu.so
python3 2644897 amin  mem	REG  	8,0	51856  11043 /usr/lib/x86_64-linux-gnu/libnss_files-2.31.so
python3 2644897 amin  mem	REG  	8,0  3035952  39505 /usr/lib/locale/locale-archive
python3 2644897 amin  mem	REG  	8,0   108936  13239 /usr/lib/x86_64-linux-gnu/libz.so.1.2.11
python3 2644897 amin  mem	REG  	8,0   182560	864 /usr/lib/x86_64-linux-gnu/libexpat.so.1.6.11
python3 2644897 amin  mem	REG  	8,0  1369384   7149 /usr/lib/x86_64-linux-gnu/libm-2.31.so
python3 2644897 amin  mem	REG  	8,0	14880  11052 /usr/lib/x86_64-linux-gnu/libutil-2.31.so
python3 2644897 amin  mem	REG  	8,0	18848   7148 /usr/lib/x86_64-linux-gnu/libdl-2.31.so
python3 2644897 amin  mem	REG  	8,0   157224  11048 /usr/lib/x86_64-linux-gnu/libpthread-2.31.so
python3 2644897 amin  mem	REG  	8,0  2029592   7147 /usr/lib/x86_64-linux-gnu/libc-2.31.so
python3 2644897 amin  mem	REG  	8,0	27002  11302 /usr/lib/x86_64-linux-gnu/gconv/gconv-modules.cache
python3 2644897 amin  mem	REG  	8,0   191504   7143 /usr/lib/x86_64-linux-gnu/ld-2.31.so
python3 2644897 amin	0u   CHR	136,0  	0t0  	3 /dev/pts/0
python3 2644897 amin	1u   CHR	136,0  	0t0  	3 /dev/pts/0
python3 2644897 amin	2u   CHR	136,0  	0t0  	3 /dev/pts/0
python3 2644897 amin	3u  IPv4 40640597  	0t0	TCP localhost:33254->localhost:mysql (ESTABLISHED)

On the bottom of this output you can see the entry for fd 3, which corresponds to a local TCP connection using port 33254 to a MySQL database server:

python3 2644897 amin	3u  IPv4 40640597  	0t0	TCP localhost:33254->localhost:mysql (ESTABLISHED)

We can then use that port number and interrogate MySQL for all running connections to see what statement is running:

mysql> SHOW FULL PROCESSLIST\G

     Id: 204
   User: suitecrm
   Host: localhost:33254
     db: suitecrm
Command: Query
   Time: 21
  State: User sleep
   Info: SELECT SLEEP(180)

Apparently, a developer accidentally left a SLEEP() statement in the code which caused statement execution to stop for three minutes.

Of course, this is a rather simplistic example, but it does show how strace and lsof can be used together to help understand a process’s behavior.

perf trace

Finally, the perf command line tool provided by the linux-tools package includes a system call tracer similar to strace. It’s worth investigating as the performance impact can be significantly less than strace. You can invoke it similarly via:

  • perf trace COMMAND
  • perf trace -p PID

Its output is a little different, and automatically displays the duration of each syscall’s execution. (For strace, use the -r flag.)

5.924 ( 0.014 ms): cat/2646005 read(fd: 3</etc/issue>, buf: 0x7fdef6a92000, count: 131072)       	= 26

Conclusion

System call tracing allows you to observe the behavior of your applications in real-time when your observability isn’t sufficient for troubleshooting. Note that this comes with a performance impact on the processes traced so use tools with caution on production systems. They also require some Linux systems knowledge to interpret, however, manpages are available to help explain the output of these tools.

I have supported production systems since 2005 and have used tools like this numerous times to help cast light on what’s happening during a production incident. If your team needs assistance on leveling up in observability and troubleshooting skills, please schedule a call with me!

(Image Credit: Mati Mango)

Tags: