Seccomp (short for Secure Computing mode) is a computer security facility in the Linux kernel. It was merged into the Linux kernel mainline in kernel version 2.6.12, which was released on March 8, 2005. Seccomp allows a process to make a one-way transition into a “secure” state in which it cannot make some system calls. If it attempts, the kernel will terminate the process tith SIGSYS.
Seccomp-BPF was released in 2012, providing more syscall filtering features on bpf.
It is used in many sandbox-like applications (i.e. Chrome/Chromium, Firefox, Docker, QEMU, Android, Systemd, OpenSSH…) for resource isolation purposes.
Basic example
Question: How to block specified syscalls?
First off, we need header files to use libseccomp2. Get the package installed:
1
apt install libseccomp-dev
The following code (function filter_syscalls()) shows how we use seccomp in common. It filters the fchmodat and symlinkat syscalls. And also blocks write syscall, if the write count argument exceeds 2048.
Note: compile the above with -lseccomp flags, and run it when we get our secured shell.
Then, try it with the execed bash prompt:
1234567891011121314151617181920
brooke@VM-250-12-ubuntu:~/seccomp_demo$ gcc seccomp_basic.c -l seccomp && ./a.out
[DEBUG]seccomp_basic.c: 32: filtering syscalls...
brooke@VM-250-12-ubuntu:~/seccomp_demo$ chmod -x a.out # test fchmodat
Bad system call (core dumped)
brooke@VM-250-12-ubuntu:~/seccomp_demo$ ln -s a.out # test symlinkat
Bad system call (core dumped)
brooke@VM-250-12-ubuntu:~/seccomp_demo$ echo "hello" # test write
hello
brooke@VM-250-12-ubuntu:~/seccomp_demo$ cat seccomp_basic.c # test write
Bad system call (core dumped)
brooke@VM-250-12-ubuntu:~/seccomp_demo$ cat /proc/$$/status
...
NoNewPrivs: 1 # cannot be applied to child processes with greater privileges
Seccomp: 2 # Seccomp filter mode
...
brooke@VM-250-12-ubuntu:~/seccomp_demo$ sudo ls
sudo: effective uid is not 0, is /usr/bin/sudo on a file system with the 'nosuid' option set or an NFS file system without root privileges?
brooke@VM-250-12-ubuntu:~/seccomp_demo$ exit # Don't forget quit bash
As expected, the process (subprocess) invoke filtered syscall get SIGSYS, and core-dumped.
Export filter’s bpf
Underneath, seccomp performs filtering by using bpf, which we’ll explain later. The libseccomp provide useful funcitons to generate and output the corresponding bpf as well as pfc (Pseudo Filter Code). Thus we can take a more close look.
For a trival case, we only filter the fchmodat syscall, and export bpf:
It seems quite straightforward. And there’s an awesome tool: seccomp-tools which can disassembles seccomp_filter.bpf above:
123456789
line CODE JT JF K
0000: 0x20 0x00 0x00 0x00000004 A = arch
0001: 0x15 0x00 0x05 0xc000003e if (A != ARCH_X86_64) goto 0007
0002: 0x20 0x00 0x00 0x00000000 A = sys_number
0003: 0x35 0x00 0x01 0x40000000 if (A < 0x40000000) goto 0005
0004: 0x15 0x00 0x02 0xffffffff if (A != 0xffffffff) goto 0007
0005: 0x15 0x01 0x00 0x0000010c if (A == fchmodat) goto 0007
0006: 0x06 0x00 0x00 0x7fff0000 return ALLOW
0007: 0x06 0x00 0x00 0x00000000 return KILL
Seccomp-BPF
Seccomp-BPF is just an extension of cBPF (classical Berkeley Packet Filter, Note: not eBPF).
The tiny bpf program runs on a specific VM in kernel, with a rather limited registers and a more reduced instruction set.
BPF code definitions in /usr/include/linux/filter.h:
We can of course, directly apply seccomp-bpf binary code with prctl(), which wraps the seccomp syscall,
to gain more fine-graind control of our bpf. But in most casses, those libseccomp wrappers, like seccomp_rule_add() just works.
The binary code is the same as the just hexdumped file for filtering fchmodat.
int filter_syscalls() {
int ret = -1;
log_debug("filtering syscalls with bpf...");
struct sock_filter code[] = {
/* op, jt, jf, k */
{0x20, 0x00, 0x00, 0x00000004},
{0x15, 0x00, 0x05, 0xc000003e},
{0x20, 0x00, 0x00, 0x00000000},
{0x35, 0x00, 0x01, 0x40000000},
{0x15, 0x00, 0x02, 0xffffffff},
{0x15, 0x01, 0x00, 0x0000010c}, // 268 fchmodat
{0x06, 0x00, 0x00, 0x7fff0000},
{0x06, 0x00, 0x00, 0x00000000},
};
struct sock_fprog bpf = {
.len = ARRAY_SIZE(code),
.filter = code,
};
ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
if (ret < 0) { log_error("error prctl set no new privs"); return EXIT_FAILURE; }
prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bpf);
if (ret < 0) { log_error("error prctl set seccomp filter"); return EXIT_FAILURE; }
return 0;
}
Performance Overhead
There is no such thing as a free lunch, so as the seccomp-bpf. After all,it is a hooking program, that runs each time whever and whatever a syscall invoked.
We benchmarked 3 senarios: no filter, filter that blocks 1 syscall, and filter that blocks 100 syscall (a more sophisticated bpf).
And we measured the time elpased during 10million write() syscall, and plotted as following:
hc_test
As it shows, the overhead is around 5%~10%, and will be even more with the larger bpf code.
Summary
In this post, we managed to filter syscalls with several seccomp-related facilities, inspect the seccomp-bpf code, and understand its costs.
This would be helpful especially if you’re implementing your sandbox-like applications that need security concerns.
Wish you enjoy hacking!