Store and Pin Async Functions in Rust

Aug 12th, 2024 10:19 am | Comments

In a typical business flow implementation, we usually store the function pointers in some collection data structures (list, hashmap..), and dynamically invoke them according to the inputs/events/conditions, with multiple thread.

But how to with those async function? In dynamic language like JavaScript/Python, this is fairly simple and straightforward. However, this is not the case in the Rust world.

As we’ve tried, storing async functions in a Vec can be challenging because Rust’s async functions (and closures) typically return different types that all implement the Future trait. However, these types are not the same, even if they have the same signature. In Rust, we have to pin and box the Future returned as well as the whole function.

Define the Boxed Future

What we need to do first iss defining a type alias for a boxed future:

type BoxFuture<'a, T> = Pin<Box<dyn Future<Output = T> + Send + 'a>>;

Pin<Box<...>>: The Pin type ensures that the Future inside the Box cannot be moved in memory, which is important for certain async operations. dyn Future<Output = T> + Send + 'a:

dyn Future<Output = T>: A dynamically dispatched future returning a value of type T.
Send: Ensures the future can be sent across thread boundaries.
'a: Ensures that all references within the future are valid for at least as long as the lifetime ‘a.

Pin and Box the async function with wrapper

Then we need to wrap our async functions, making them return the boxed future. Say our async functions are all of the same signature:

async fn async_fn1(v: i32) -> Result<i32, Box<dyn Error + Send>>

The wrapper should be:

type BoxAsyncFn = Box<dyn Fn(i32) -> BoxFuture<'static, Result<i32, Box<dyn Error + Send>>> + Send + Sync>;

fn box_async_fn<F, Fut>(f: F) -> BoxAsyncFn
where
    F: Fn(i32) -> Fut + Send + Sync + 'static,
    Fut: Future<Output = Result<i32, Box<dyn Error + Send>>> + Send + 'static,
{
    Box::new(move |v| Box::pin(f(v)))
}

To make our async function can be called/refered, we add Send and Sync trait bounds to our boxed function.Notice that the result Future it returns need not to be Sync, for we don’t need to refer it in different threads usually, we just consume it on the fly. As a recap, Send means we can safely transfer ownership among threads, and Sync means that we can safely transfer reference among threads.

An example for all

With them all, our final code should be as:

use std::future::Future;
use std::pin::Pin;
use std::collections::VecDeque;
use std::error::Error;
use std::boxed::Box;

// Define a type alias for a boxed future with a specific lifetime
type BoxFuture<'a, T> = Pin<Box<dyn Future<Output = T> + Send + 'a>>;

// Define a type alias for a boxed async function that takes an i32 parameter
type BoxAsyncFn = Box<dyn Fn(i32) -> BoxFuture<'static, Result<i32, Box<dyn Error + Send>>> + Send + Sync>;

// Example async function that takes an i32 parameter and returns a Result
async fn async_fn1(v: i32) -> Result<i32, Box<dyn Error + Send>> {
    Ok(v + 1)
}

// Another example async function that takes an i32 parameter and returns a Result
async fn async_fn2(v: i32) -> Result<i32, Box<dyn Error + Send>> {
    Ok(v * 2)
}

// Function to box the async functions that take an i32 parameter
fn box_async_fn<F, Fut>(f: F) -> BoxAsyncFn
where
    F: Fn(i32) -> Fut + Send + Sync + 'static,
    Fut: Future<Output = Result<i32, Box<dyn Error + Send>>> + Send + 'static,
{
    Box::new(move |v| Box::pin(f(v)))
}

fn main() {
    // Create a vector to store boxed async functions
    let mut async_fns: VecDeque<BoxAsyncFn> = VecDeque::new();

    // Add async functions to the vector without parameters
    async_fns.push_back(box_async_fn(async_fn1));
    async_fns.push_back(box_async_fn(async_fn2));

    // Parameters to pass to the async functions
    let params = vec![10, 20];

    // Execute the async functions with parameters
    for (fut, &param) in async_fns.iter().zip(params.iter()) {
        match futures::executor::block_on(fut(param)) {
            Ok(result) => println!("Success: {}", result),
            Err(e) => eprintln!("Error: {}", e),
        }
    }
}

Some explanations here: - BoxFuture Type Alias simplifies the type signature for boxed futures. - Async Functions, like async_fn1, are our example async functions, containing the business logic. - The wrapper function box_async_fn, it box the async functions into the BoxFuture type. - With the Storing Vec, or you can use any collection type, like Vector or Hashmap, we store our boxed function (namely function pointer), in it.

And of course, you can then use an async runtime like Tokio (or async-std) to run our functions for better performance, but this basic example juse demonstrates the concept.

Fear no fraid and happy Pinning(Boxing) your async functions with Rust!

[Solved] Shadowsocks: Undefined Symbol

Sep 23rd, 2020 3:03 pm | Comments

Error: undefined symbol EVP_CIPHER_CTX_cleanup

After installing shadowsocks on Ubuntu cloud server:

sudo apt install python3-pip
sudo pip3 install shadowsocks

It should be easy for us to start server like:

ssserver -k hello -p 12345

However, for Ubuntu 18.04 or 20.04, Shadowsocks server command ssserver would lead to the following error:

2020-09-23 11:33:21 INFO     loading libcrypto from libcrypto.so.1.1
Traceback (most recent call last):
  File "/usr/local/bin/ssserver", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/shadowsocks/server.py", line 34, in main
    config = shell.get_config(False)
  File "/usr/local/lib/python3.8/dist-packages/shadowsocks/shell.py", line 262, in get_config
    check_config(config, is_local)
  File "/usr/local/lib/python3.8/dist-packages/shadowsocks/shell.py", line 124, in check_config
    encrypt.try_cipher(config['password'], config['method'])
  File "/usr/local/lib/python3.8/dist-packages/shadowsocks/encrypt.py", line 44, in try_cipher
    Encryptor(key, method)
  File "/usr/local/lib/python3.8/dist-packages/shadowsocks/encrypt.py", line 82, in __init__
    self.cipher = self.get_cipher(key, method, 1,
  File "/usr/local/lib/python3.8/dist-packages/shadowsocks/encrypt.py", line 109, in get_cipher
    return m[2](method, key, iv, op)
  File "/usr/local/lib/python3.8/dist-packages/shadowsocks/crypto/openssl.py", line 76, in __init__
    load_openssl()
  File "/usr/local/lib/python3.8/dist-packages/shadowsocks/crypto/openssl.py", line 52, in load_openssl
    libcrypto.EVP_CIPHER_CTX_cleanup.argtypes = (c_void_p,)
  File "/usr/lib/python3.8/ctypes/__init__.py", line 386, in __getattr__
    func = self.__getitem__(name)
  File "/usr/lib/python3.8/ctypes/__init__.py", line 391, in __getitem__
    func = self._FuncPtr((name_or_ordinal, self))
AttributeError: /lib/x86_64-linux-gnu/libcrypto.so.1.1: undefined symbol: EVP_CIPHER_CTX_cleanup

It seems that there’s no symbol EVP_CIPHER_CTX_cleanup within libcrypto.so, which shadowsocks needs.

check the symbols in libcrypto.so.1.1:

brooke@VM-101-145-ubuntu:~$ nm -gD /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1 | grep EVP_CIPHER_CTX
000000000016dab0 T EVP_CIPHER_CTX_block_size
000000000016deb0 T EVP_CIPHER_CTX_buf_noconst
000000000016dae0 T EVP_CIPHER_CTX_cipher
000000000016e510 T EVP_CIPHER_CTX_clear_flags
000000000016d350 T EVP_CIPHER_CTX_copy
000000000016cd10 T EVP_CIPHER_CTX_ctrl
000000000016daf0 T EVP_CIPHER_CTX_encrypting
000000000016c160 T EVP_CIPHER_CTX_free
000000000016db10 T EVP_CIPHER_CTX_get_app_data
000000000016db30 T EVP_CIPHER_CTX_get_cipher_data
000000000016de90 T EVP_CIPHER_CTX_iv
000000000016db60 T EVP_CIPHER_CTX_iv_length
000000000016dea0 T EVP_CIPHER_CTX_iv_noconst
000000000016def0 T EVP_CIPHER_CTX_key_length
000000000016c140 T EVP_CIPHER_CTX_new
000000000016e060 T EVP_CIPHER_CTX_nid
000000000016dec0 T EVP_CIPHER_CTX_num
000000000016de80 T EVP_CIPHER_CTX_original_iv
000000000016d310 T EVP_CIPHER_CTX_rand_key
000000000016c090 T EVP_CIPHER_CTX_reset
000000000016db20 T EVP_CIPHER_CTX_set_app_data
000000000016db40 T EVP_CIPHER_CTX_set_cipher_data
000000000016e500 T EVP_CIPHER_CTX_set_flags
000000000016d2a0 T EVP_CIPHER_CTX_set_key_length
000000000016ded0 T EVP_CIPHER_CTX_set_num
000000000016cce0 T EVP_CIPHER_CTX_set_padding
000000000016e520 T EVP_CIPHER_CTX_test_flags

No cleanup, but reset. And it was changed in OpenSSL 1.1.0, introduced here.

EVP_CIPHER_CTX was made opaque in OpenSSL 1.1.0. As a result, EVP_CIPHER_CTX_reset() appeared and EVP_CIPHER_CTX_cleanup() disappeared. EVP_CIPHER_CTX_init() remains as an alias for EVP_CIPHER_CTX_reset().

Solution

Solved by replacing all (total 2) the EVP_CIPHER_CTX_cleanup() functions to EVP_CIPHER_CTX_reset() in the openssl.py file. With the following command:

edit /usr/local/lib/python3.8/dist-packages/shadowsocks/crypto/openssl.py

sudo sed -i 's/EVP_CIPHER_CTX_cleanup/EVP_CIPHER_CTX_reset/g' /usr/local/lib/python3.8/dist-packages/shadowsocks/crypto/openssl.py

And restart the Shadowsocks server. Done!

Linux Seccomp Filters

Dec 12th, 2019 10:20 pm | Comments

Overview

Seccomp (short for Secure Computing mode) is a computer security facility in the Linux kernel. It was merged into the Linux kernel mainline in kernel version 2.6.12, which was released on March 8, 2005. Seccomp allows a process to make a one-way transition into a “secure” state in which it cannot make some system calls. If it attempts, the kernel will terminate the process tith SIGSYS. Seccomp-BPF was released in 2012, providing more syscall filtering features on bpf. It is used in many sandbox-like applications (i.e. Chrome/Chromium, Firefox, Docker, QEMU, Android, Systemd, OpenSSH…) for resource isolation purposes.

Basic example

Question: How to block specified syscalls?

First off, we need header files to use libseccomp2. Get the package installed:

apt install libseccomp-dev

The following code (function filter_syscalls()) shows how we use seccomp in common. It filters the fchmodat and symlinkat syscalls. And also blocks write syscall, if the write count argument exceeds 2048.

// ... other headers and macros ommited
#include <seccomp.h>

int filter_syscalls() {
    int ret = -1;
    scmp_filter_ctx ctx;

    log_debug("filtering syscalls...");
    ctx = seccomp_init(SCMP_ACT_ALLOW);
    if (!ctx) { log_error("error seccomp ctx init"); return ret; }

    // prohibits specified syscall
    ret = seccomp_rule_add(ctx, SCMP_ACT_KILL, SCMP_SYS(fchmodat), 0);
    if (ret < 0) { log_error("error seccomp rule add: fchmodat"); goto out; }

    ret = seccomp_rule_add(ctx, SCMP_ACT_KILL, SCMP_SYS(symlinkat), 0);
    if (ret < 0) { log_error("error seccomp rule add: symlinkat"); goto out; }

    // limit syscall arguments
    ret = seccomp_rule_add(ctx, SCMP_ACT_KILL, SCMP_SYS(write), 1,
            SCMP_A2_64(SCMP_CMP_GT, 2048));
    if (ret < 0) { log_error("error seccomp rule add: write"); goto out; }

    ret = seccomp_load(ctx);
    if (ret < 0) { log_error("error seccomp load"); goto out; }

out:
    seccomp_release(ctx);
    if (ret != 0) return -1;

    return 0;
}

extern char **environ;

int main(int argc, char *argv[]) {
    int ret = -1;

    ret = filter_syscalls();
    if (ret != 0) { log_error("filter syscall failed"); return EXIT_FAILURE; }

    char *prog = "/bin/bash";
    ret = execve(prog, (char *[]){prog, 0}, environ);
    log_debug("%d", ret);
    if (ret < 0) { log_error("exec failed"); return EXIT_FAILURE; }

    return EXIT_SUCCESS;
}

For all source code & detail, check here.

Note: compile the above with -lseccomp flags, and run it when we get our secured shell.

Then, try it with the execed bash prompt:

brooke@VM-250-12-ubuntu:~/seccomp_demo$ gcc seccomp_basic.c -l seccomp && ./a.out
[DEBUG]seccomp_basic.c: 32: filtering syscalls...
brooke@VM-250-12-ubuntu:~/seccomp_demo$ chmod -x a.out    # test fchmodat
Bad system call (core dumped)
brooke@VM-250-12-ubuntu:~/seccomp_demo$ ln -s a.out       # test symlinkat
Bad system call (core dumped)
brooke@VM-250-12-ubuntu:~/seccomp_demo$ echo "hello"      # test write
hello
brooke@VM-250-12-ubuntu:~/seccomp_demo$ cat seccomp_basic.c   # test write
Bad system call (core dumped)

brooke@VM-250-12-ubuntu:~/seccomp_demo$ cat /proc/$$/status
...
NoNewPrivs:     1     # cannot be applied to child processes with greater privileges
Seccomp:        2     # Seccomp filter mode
...

brooke@VM-250-12-ubuntu:~/seccomp_demo$ sudo ls
sudo: effective uid is not 0, is /usr/bin/sudo on a file system with the 'nosuid' option set or an NFS file system without root privileges?
brooke@VM-250-12-ubuntu:~/seccomp_demo$ exit  # Don't forget quit bash

As expected, the process (subprocess) invoke filtered syscall get SIGSYS, and core-dumped.

Export filter’s bpf

Underneath, seccomp performs filtering by using bpf, which we’ll explain later. The libseccomp provide useful funcitons to generate and output the corresponding bpf as well as pfc (Pseudo Filter Code). Thus we can take a more close look.

For a trival case, we only filter the fchmodat syscall, and export bpf:

int filter_syscalls() {
    int ret = -1;
    scmp_filter_ctx ctx;

    log_debug("filtering syscalls...");
    ctx = seccomp_init(SCMP_ACT_ALLOW);
    if (!ctx) { log_error("error seccomp ctx init"); return ret; }

    // prohibits specified syscall
    ret = seccomp_rule_add(ctx, SCMP_ACT_KILL, SCMP_SYS(fchmodat), 0);
    if (ret < 0) { log_error("error seccomp rule add: fchmodat"); goto out; }

    ret = seccomp_load(ctx);
    if (ret < 0) { log_error("error seccomp load"); goto out; }


    // export bpf
    int bpf_fd = open("seccomp_filter.bpf", O_CREAT | O_WRONLY | O_TRUNC, 0666);
    if (bpf_fd == -1) { log_error("error open"); goto out; }
    ret = seccomp_export_bpf(ctx, bpf_fd);
    if (ret < 0) { log_error("error export"); goto out; }
    close(bpf_fd);

    // export pfc
    int pfc_fd = open("seccomp_filter.pfc", O_CREAT | O_WRONLY | O_TRUNC, 0666);
    if (pfc_fd == -1) { log_error("error open"); goto out; }
    ret = seccomp_export_pfc(ctx, pfc_fd);
    if (ret < 0) { log_error("error export"); goto out; }
    close(pfc_fd);

out:
    seccomp_release(ctx);
    if (ret != 0) return -1;

    return 0;
}

We got 2 files, bpf and pfc:

$ hd seccomp_filter.bpf      # hexdump bpf file
00000000  20 00 00 00 04 00 00 00  15 00 00 05 3e 00 00 c0  | ...........>...|
00000010  20 00 00 00 00 00 00 00  35 00 00 01 00 00 00 40  | .......5......@|
00000020  15 00 00 02 ff ff ff ff  15 00 01 00 0c 01 00 00  |................|
00000030  06 00 00 00 00 00 ff 7f  06 00 00 00 00 00 00 00  |................|
00000040

$ cat seccomp_filter.pfc
#
# pseudo filter code start
#
# filter for arch x86_64 (3221225534)
if ($arch == 3221225534)
  # filter for syscall "fchmodat" (268) [priority: 65535]
    if ($syscall == 268)
        action KILL;
          # default action
            action ALLOW;
# invalid architecture action
action KILL;
#
# pseudo filter code end
#

It seems quite straightforward. And there’s an awesome tool: seccomp-tools which can disassembles seccomp_filter.bpf above:

 line  CODE  JT   JF      K
0x20 0x00 0x00 0x00000004  A = arch
0x15 0x00 0x05 0xc000003e  if (A != ARCH_X86_64) goto 0007
0x20 0x00 0x00 0x00000000  A = sys_number
0x35 0x00 0x01 0x40000000  if (A < 0x40000000) goto 0005
0x15 0x00 0x02 0xffffffff  if (A != 0xffffffff) goto 0007
0x15 0x01 0x00 0x0000010c  if (A == fchmodat) goto 0007
0x06 0x00 0x00 0x7fff0000  return ALLOW
0x06 0x00 0x00 0x00000000  return KILL

Seccomp-BPF

Seccomp-BPF is just an extension of cBPF (classical Berkeley Packet Filter, Note: not eBPF). The tiny bpf program runs on a specific VM in kernel, with a rather limited registers and a more reduced instruction set.

BPF code definitions in /usr/include/linux/filter.h:

struct sock_filter {    /* Filter block */
        __u16   code;   /* Actual filter code */
        __u8    jt;     /* Jump true */
        __u8    jf;     /* Jump false */
        __u32   k;      /* Generic multiuse field */
};

We can of course, directly apply seccomp-bpf binary code with prctl(), which wraps the seccomp syscall, to gain more fine-graind control of our bpf. But in most casses, those libseccomp wrappers, like seccomp_rule_add() just works. The binary code is the same as the just hexdumped file for filtering fchmodat.

int filter_syscalls() {
    int ret = -1;

    log_debug("filtering syscalls with bpf...");

    struct sock_filter code[] = {
        /* op,   jt,   jf,     k    */
        {0x20, 0x00, 0x00, 0x00000004},
        {0x15, 0x00, 0x05, 0xc000003e},
        {0x20, 0x00, 0x00, 0x00000000},
        {0x35, 0x00, 0x01, 0x40000000},
        {0x15, 0x00, 0x02, 0xffffffff},
        {0x15, 0x01, 0x00, 0x0000010c}, // 268 fchmodat
        {0x06, 0x00, 0x00, 0x7fff0000},
        {0x06, 0x00, 0x00, 0x00000000},
    };

    struct sock_fprog bpf = {
        .len = ARRAY_SIZE(code),
        .filter = code,
    };

    ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
    if (ret < 0) { log_error("error prctl set no new privs"); return EXIT_FAILURE; }

    prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bpf);
    if (ret < 0) { log_error("error prctl set seccomp filter"); return EXIT_FAILURE; }

    return 0;
}

Performance Overhead

There is no such thing as a free lunch, so as the seccomp-bpf. After all,it is a hooking program, that runs each time whever and whatever a syscall invoked. We benchmarked 3 senarios: no filter, filter that blocks 1 syscall, and filter that blocks 100 syscall (a more sophisticated bpf). And we measured the time elpased during 10million write() syscall, and plotted as following:

hc_test

As it shows, the overhead is around 5%~10%, and will be even more with the larger bpf code.

Summary

In this post, we managed to filter syscalls with several seccomp-related facilities, inspect the seccomp-bpf code, and understand its costs. This would be helpful especially if you’re implementing your sandbox-like applications that need security concerns. Wish you enjoy hacking!

References

Distributed Load Testing by Locust

Nov 13th, 2018 10:23 am | Comments

Overview

What is Distributed Load Testing?

In my previous post http load testing with wrk2, we’ve introduced some of the concepts of HTTP benchmarking. However that is not the only case, especially when the client is not as performant as the target server. More often, system developers use multiple clients (load testing clusters) to generate more loads. As shown with the picture:

Why locust?

There’s lots of open-source aimed at distributed load testing, like:

locust gatling artillery vegeta tsung jmeter

While some of them are quite promising, in this post we choose locust, for it’s one of the most popular in github and it’s very easy to use and demo.

Simple local test

Next, let’s start with a simple case to show how developer perform distributed load testing with locust.

example workflow

Install locust package

apt -y install python3-pip
pip3 install locustio

Edit locustfile.py, the simplest one:

from locust import HttpLocust, TaskSet, task

class UserBehavior(TaskSet):
    @task(1)
    def index(self):
        self.client.get("/")

class WebsiteUser(HttpLocust):
    task_set = UserBehavior
    min_wait = 5000
    max_wait = 9000

In fact, locust supports user scenarios by this kind of python script, which is very customizable.

Start master

locust --host=http://TEST_TARGET_HOST:PORT --master

Start slave

locust --slave --master-host=MASTER_HOST

Finally, go to master web-ui (default port on 8089), and start spawning locust!

Before launching, we should specify our expected number of users, and the hatch rate.

When testing, locust web-ui provides real-time statistics for us, including RPS, Response Time.

Trouble shooting

You’ll get too open files issue if you forget to set it with ulimit command.

problems with locust

Python is not a fast language, the problem is even exacerbated when it comes to the IO-intensive application. Fortunately, locust is designed and implemented to be well scaled out, and the problem can be ameliorated a lot.

AutoScaling cluster on cloud

For deploying locust testing on cloud, the most suitable product is cluster management service, no matter whether it’s the cluster for virtual machine (AutoScaling), or the cluster for container (Kubernetes Service). In this post, we conduct our load testing with AutoScaling of TencentCloud.

First, we build our locust image in which the locust package is installed and locust file is prepared. Second, we create our testing cluster, namely an AutoScalingGroup of our test LaunchConfiguration, which is an instance template associated with the locust image we just build. Additionally, we need to make sure that our testing cluster and target host are in the same VirtualPrivateCloud, so that they can access each other within the private network and cost little for network bandwidth. And setup the External IP as well as the Security groups for locust master node, as we’ll access its port 8089 for the web-ui.

Now that we’ve set up our load testing cluster, we could tune its Expected Instance Number on our demand.

When to add client?

Usually, as long as the final RPS increases proportionally with the number of clients, we should consider add 1 more client. In our example, 6 clients proved to be just enough.

How to inspect the chart to find the best performance?

When the QoS is no longer satisfied, there comes the point. As the above chart shows, the p95 is longer than 1s when the number of concurrent user reaches to 22K, and the RPS is 2830 by that time. That’s it.

Conclusions

By this post, we’ve completed an example of distributed load testing. Although there’s a plenty of choice of testing tools, the locust is such an easy-to-use tool to help us understand the basic concepts. Thanks to the cloud-managed service like AutoScaling, we can manage our workload cluster with ease. And happy benchmarking!

References

Worker Pool With Eventfd

Nov 9th, 2018 10:41 pm | Comments

Linux Eventfd Overview

An “eventfd object” can be used as an event wait/notify mechanism by user-space applications, and by the kernel to notify user-space applications of events. It has been added to kernel since Linux 2.6.22. And the object contains an unsigned 64-bit integer (uint64_t) counter that is maintained by the kernel. So it’s extremely fast to access.

#include <sys/eventfd.h>
int eventfd(unsigned int initval, int flags);

That’s all we need to create one eventfd file, after that, we can perform normal file operations (like read/write, poll and close) with it.

Once some user-space thread write it with value greater than 0, it will instantly be notified to user-space by kernel. Then, the first thread which read it, will reset it (zero its counter), i.e. consume the event. And all the later read will get Error (Resource Temporarily Unavailable), until it is written again (event triggered). Briefly, it transforms an event to a file descriptor that can be effectively monitored.

There’re several notes of which we should take special account:

Applications can use an eventfd file descriptor instead of a pipe in all cases where a pipe is used simply to signal events. The kernel overhead of an eventfd file descriptor is much lower than that of a pipe, and only one file descriptor is required (versus the two required for a pipe).

As with signal events, eventfd is much more light-weight (thus fast) compared to the pipes, it’s just a counter in kernel after all.

A key point about an eventfd file descriptor is that it can be monitored just like any other file descriptor using select(2), poll(2), or epoll(7). This means that an application can simultaneously monitor the readiness of “traditional” files and the readiness of other kernel mechanisms that support the eventfd interface.

You won’t wield the true power of eventfd, unless you monitor them with epoll (especially EPOLLET).

So, let’s get our hands dirty with an simple worker thread pool!

Worker Pool Design

We adopt Producer/Consumer pattern for our worker thread pool, as it’s the most common style of decoupling, achieving the best scalability. By leveraging the asynchronous notification feature from the eventfd, our inter-thread communication sequence could be described as following:

Implementation

Our per-thread data structure is fairly simple, only contains 3 fields: thread_id, rank (thread index) and epfd which is the epoll file descriptor created by main function.

typedef struct thread_info {
    pthread_t thread_id;
    int rank;
    int epfd;
} thread_info_t;

Consumer thread routine

static void *consumer_routine(void *data) {
    struct thread_info *c = (struct thread_info *)data;
    struct epoll_event *events;
    int epfd = c->epfd;
    int nfds = -1;
    int i = -1;
    int ret = -1;
    uint64_t v;
    int num_done = 0;

    events = calloc(MAX_EVENTS_SIZE, sizeof(struct epoll_event));
    if (events == NULL) exit_error("calloc epoll events\n");

    for (;;) {
        nfds = epoll_wait(epfd, events, MAX_EVENTS_SIZE, 1000);
        for (i = 0; i < nfds; i++) {
            if (events[i].events & EPOLLIN) {
                log_debug("[consumer-%d] got event from fd-%d",
                        c->rank, events[i].data.fd);
                ret = read(events[i].data.fd, &v, sizeof(v));
                if (ret < 0) {
                    log_error("[consumer-%d] failed to read eventfd", c->rank);
                    continue;
                }
                close(events[i].data.fd);
                do_task();
                log_debug("[consumer-%d] tasks done: %d", c->rank, ++num_done);
            }
        }
    }
}

As we can see, the worker thread get the notification by simply polling epoll_wait() the epoll-added fd list, and read() the eventfd to consume it, then close() to clean it. And we can do anything sequential within the do_task, although it now does nothing.

In short: poll -> read -> close.

Producer thread routine

static void *producer_routine(void *data) {
    struct thread_info *p = (struct thread_info *)data;
    struct epoll_event event;
    int epfd = p->epfd;
    int efd = -1;
    int ret = -1;
    int interval = 1;

    log_debug("[producer-%d] issues 1 task per %d second", p->rank, interval);
    while (1) {
        efd = eventfd(0, EFD_CLOEXEC | EFD_NONBLOCK);
        if (efd == -1) exit_error("eventfd create: %s", strerror(errno));
        event.data.fd = efd;
        event.events = EPOLLIN | EPOLLET;
        ret = epoll_ctl(epfd, EPOLL_CTL_ADD, efd, &event);
        if (ret != 0) exit_error("epoll_ctl");
        ret = write(efd, &(uint64_t){1}, sizeof(uint64_t));
        if (ret != 8) log_error("[producer-%d] failed to write eventfd", p->rank);
        sleep(interval);
    }
}

In producer routine, after creating eventfd, we register the event with epoll object by epoll_ctl(). Note that the event is set for write (EPOLLIN) and Edge-Triggered (EPOLLET). For notification, what we need to do is just write 0x1 (any value you want) to eventfd.

In short: create -> register -> write.

Source code repository: eventfd_examples

Output & Analysis

The expected output is clear as:

You can adjust threads number to inspect the detail, and there’s a plethora of fun with it.

But now, let’s try something hard. We’ll smoke test our worker by generate a heavy instant load, instead of the former regular one. And we tweak the producer/consumer thread to 1, and watching the performance.

static void *producer_routine_spike(void *data) {
    struct thread_info *p = (struct thread_info *)data;
    struct epoll_event event;
    int epfd = p->epfd;
    int efd = -1;
    int ret = -1;
    int num_task = 1000000;

    log_debug("[producer-%d] will issue %d tasks", p->rank, num_task);
    for (int i = 0; i < num_task; i++) {
        efd = eventfd(0, EFD_CLOEXEC | EFD_NONBLOCK);
        if (efd == -1) exit_error("eventfd create: %s", strerror(errno));
        event.data.fd = efd;
        event.events = EPOLLIN | EPOLLET;
        ret = epoll_ctl(epfd, EPOLL_CTL_ADD, efd, &event);
        if (ret != 0) exit_error("epoll_ctl");
        ret = write(efd, &(uint64_t){1}, sizeof(uint64_t));
        if (ret != 8) log_error("[producer-%d] failed to write eventfd", p->rank);
    }
    return (void *)0;
}

Over 1 million? Indeed! By using the ulimit command below, we can increase the open files limit of the current shell, which is usually 1024 by default. Note that you need to be root.

ulimit -n 1048576

# 1048576 is the default maximum for open files, as `/proc/sys/fs/nr_open` shows.
# To make it larger, you need to tweak kernel settings like this (which is beyond our scope)
# sysctl -w fs.nr_open=10485760

Since the info of stdout is so much that we redirect the stdout to file log.

With my test VM (S2.Medium4 type on TencentCloud, which has only 2 vCPU and 4G memory, it takes less than 6.5 seconds to deal with 1 million concurrent (almost) events. And we’ve seen the kernel-implemented counters and wait queue are quite efficient.

Conclusions

Multi-threaded programming model is prevailing now, while the best way of scheduling (event trigger and dispatching method) is still under discussion and sometimes even opinionated. In this post, we’ve implemented general-purposed worker thread pool based on an advanced message mechanism, which includes:

message notification: asynchronous delivering, extremely low overhead, high performance
message dispatching: as a load balancer, highly scalable
message buffering: as message queue, with robustness

All the above are fulfilled by using basic Linux kernel feature/syscall, like epoll and eventfd. Everyone may refers to this approach when he/she designs a single-process performant (especially IO-bound) background service.

To sum up, taking advantage of Linux kernel capability, we are now managed to implement our high-performance message-based worker pool, which is able to deal with large throughput and of high scalability.

References

Http Load Testing With Wrk2

Nov 5th, 2018 4:36 pm | Comments

Overviews

How to measure our server’s performance? With this article, we’ll discuss and experiment with HTTP server benchmark.

Let’s start to recap some of the key concept related:

Connection

The number of simultaneous tcp connections, sometimes refered as Number of Users in other benchmark tools.
Latency

For HTTP request, it is the same as the Response Time, measured by ms. And it is tested from clients. The latency percentile, like p50/p90/p99, is the most common QoS metric.
Throughput

For HTTP request, it’s also refered as requests/second or RPS for short. Usually, as the number of connections increases, the system throughput goes down.

So, what does load testing really mean?

In brief, it’s to determine the maximum throughput (the highest RPS), under specified number of connection, with all response time satisfying the latency target.

Thus, we can remark a server capability like this:

“Our server instance can achieve 20K RPS under 5K simultaneous connections with latency p99 at less than 200ms.”

What’s wrk2

wrk2 is an HTTP benchmarking cli tool, which is considered better than ab or wrk. With wrk2, we are able to generate some constant throughput load, and its latency detail is more accurate. As a command-line tool, it’s quite convenient and fast.

-d: duration, test time. Note that it has a 10 second calibration time, so this should be specified no shorter than 20s.
-t: threads num. Just set it to cpu cores.
-R: or –rate, expected throughput, the result RPS which is real throughput, will be lower than this value.
-c: connections num. The Number of connections that will be kept open.

SUT simple implementation

All servers are simple http-server, which simply response Hello, world!\n to clients.

Rust 1.28.0 (hyper 0.12)
Go 1.11.1 http module
Node.js 8.11.4
Python 3.5.2 asyncio

Testing Workflow

Our latency target: The 99 percentile is less than 200ms. It’s a fairly high performance in real world.

Due to the calibration time of wrk2, all the test last for 30~60 seconds. And since our test machine has 2 cpu-threads, our command is like:

./wrk -t2 -c100 -d60 -R 18000 -L http://$HOST:$PORT/

We iterate to execute the command, and increase the request rate (-R argument) by 500 on each turn until we find the maximum RPS. The whole workflow can be explained as:

Then we go on test for a larger number of connections, until the latency target is no longer satisfied or socket connection errors occur. And move to next server.

Results Analysis

Now, let’s feed our output data to plot program with matplotlib, and finally get the whole picture below:

The plot is fairly clear. Rust beats Go even in such an I/O intensive scenario, which shows the non-blocking sockets version of hyper really makes something great. Node.js is indeed slower than Go, while Python’s default asyncio event loop have a rather poor performance. As a spoiler alert, for the even more connections (i.e. 5K, 10K…), both Rust and Go can still hold very well without any socket connection error, though the response time is longer, and Rust still performed better, while the last two may get down.

Conclusions

In this post, we managed to benchmark the performance of our web server by using wrk2. And with finite experiment steps, we could determine the server’s highest throughput under certain number of connections, which meets the specified latency QoS target.

References

Building Linux Kernel Module

Oct 23rd, 2018 10:30 am | Comments

Overview

What is Linux loadable kernel module(LKM)?

A loadable kernel module (LKM) is a mechanism for adding/removing code from Linux kernel at run time. Many of device drivers are implemented through this way, otherwise the monolithic kernel would be too large.

LKM communicates with user-space applications through system calls, and it can access almost all the objects/services of the kernel. LKM can be inserted to the monolithic kernel at any time – usually at booting or running phase.

Writing LKM has many advantages against directly tweaking the whole kernel. For LKM can be dynamically inserted or removed at run time, we don’t need to recompile the whole kernel nor reboot, and it’s more shippable.

So, the easiest way to start kernel programming is to write a module - a piece of code that can be dynamically loaded into the kernel.

How is the LKM different from an user-space application?

LKM is run in kernel space, which is quite different.

First off, the code is always asynchronous, which means it doesn’t execute sequentially and may be interrupted at any time. Thus programmers should always care about the concurrency as well as reentrant issues. Unlike user-space application, which has an entry-point like main() and then execute and exit, the LKM is more like a complicated event-driven server that internally has the ability to interact with various kernel services, and externally provides system calls as its user-space api.

Secondly, there’s only a fixed and small stack, resource cleanup as well as utilization should always be highly considered. While as for the user-space application, the resource quota is fairly sufficient.

Thirdly, note that there’s no floating-point math.

Prepare Headers

apt search linux-headers-$(uname -r)
# get our kernel release: 4.18.0.kali2-amd64
apt install linux-headers-4.18.0.kali2-amd64

Simple Module Code

hello.c

#include <linux/module.h>

MODULE_LICENSE("GPL");
MODULE_AUTHOR("Brooke Yang");
MODULE_DESCRIPTION("A simple Linux moadable mernel module");
MODULE_VERSION("0.1");

static char *name = "world";
module_param(name, charp, S_IRUGO);
MODULE_PARM_DESC(name, "The name to display");

static int __init hello_init(void) {
  pr_info("HELLO: Hello, %s!\n", name);
  return 0;
}

static void __exit hello_exit(void) {
  pr_info("HELLO: Bye-bye, %s!\n", name);
}

module_init(hello_init);
module_exit(hello_exit);

pr_info is a more convenient way of debugging, comparing to the old-style printk.

printk(KERN_INFO ...);

Makefile

obj-m+=hello.o

all:
  make -C /lib/modules/$(shell uname -r)/build/ M=$(PWD) modules
clean:
  make -C /lib/modules/$(shell uname -r)/build/ M=$(PWD) clean

Build && Install

Now we can make our hello module and then a hello.ko emerged successfully.

root@kali:/opt/kernel-modules/hello# make
make -C /lib/modules/4.18.0-kali2-amd64/build/ M=/opt/kernel-modules/hello modules
make[1]: Entering directory '/usr/src/linux-headers-4.18.0-kali2-amd64'
  CC [M]  /opt/kernel-modules/hello/hello.o
  Building modules, stage 2.
  MODPOST 1 modules
  CC      /opt/kernel-modules/hello/hello.mod.o
  LD [M]  /opt/kernel-modules/hello/hello.ko
make[1]: Leaving directory '/usr/src/linux-headers-4.18.0-kali2-amd64'
root@kali:/opt/kernel-modules/hello# ls -l
total 556
-rw-r--r-- 1 root root    566 Oct 23 09:41 hello.c
-rw-r--r-- 1 root root 272720 Oct 23 09:41 hello.ko
-rw-r--r-- 1 root root    872 Oct 23 09:41 hello.mod.c
-rw-r--r-- 1 root root 136376 Oct 23 09:41 hello.mod.o
-rw-r--r-- 1 root root 137864 Oct 23 09:41 hello.o
-rw-r--r-- 1 root root    154 Oct 23 09:38 Makefile
-rw-r--r-- 1 root root     42 Oct 23 09:41 modules.order
-rw-r--r-- 1 root root      0 Oct 23 09:41 Module.symvers

Display module info

root@kali:/opt/kernel-modules/hello# modinfo hello.ko
filename:       /opt/kernel-modules/hello/hello.ko
version:        0.1
description:    A simple Linux moadable mernel module
author:         Brooke Yang
license:        GPL
srcversion:     440743A20C6C4688E185D30
depends:
retpoline:      Y
name:           hello
vermagic:       4.18.0-kali2-amd64 SMP mod_unload modversions
parm:           name:The name to display (charp)

root@kali:/opt/kernel-modules/hello# insmod hello.ko
root@kali:/opt/kernel-modules/hello# rmmod hello
root@kali:/opt/kernel-modules/hello# insmod hello.ko name=Brooke
root@kali:/opt/kernel-modules/hello# rmmod hello

we can watch the log by tail -f the /var/log/kern.log or just dmesg

Oct 23 09:51:37 kali kernel: [ 2651.831228] HELLO: Hello, world!
Oct 23 09:51:44 kali kernel: [ 2658.680087] HELLO: Bye-bye, world!
Oct 23 09:51:57 kali kernel: [ 2672.409216] HELLO: Hello, Brooke!
Oct 23 09:52:02 kali kernel: [ 2677.482181] HELLO: Bye-bye, Brooke!

Done!

note: char parm can even be Chinese.

Conclusions

With this article, we managed to complete our first yet very simple Linux loadable kernel module(LKM).

We’ve got a broad view of how the LKMs work. And we should configure our own kernel modules, build and insert/remove them at runtime, and define/pass custom parameters to them.

References

Compiling Kernel With Kali Linux

Oct 22nd, 2018 9:57 am | Comments

With the instructions in Recompiling the Kali Linux Kernel, we can recompile the whole linux kernel of our kali linux.

Install Dependencies

We need to start by installing several build dependency packages, some of them may have already been in installed.

apt install build-essential libncurses5-dev fakeroot bison libssl-dev libelf-dev

Download the Kernel Source

For my system, the kernel version is 4.18, which will be used in following example. Of course, the workflow of other version is just the same.

apt install linux-source-4.18
[...]
ls /usr/src
linux-config-4.18 linux-patch-4.18-rt.patch.xz linux-source-4.18.tar.xz

Then we get the compressed archive of the kernel sources, and we’ll extract these files in our working directory, (no special permission need for compiling the kernel). In our example, we use /opt/kernel, and the ~/kernel is also an appropriate place.

mkdir /opt/kernel; cd /opt/kernel
tar -xaf /usr/src/linux-source-4.18.tar.xz

Optionally, we may also apply the rt patch, which is for real-time os features.

cd /opt/kernel/linux-source-4.18
xzcat /usr/src/linux-patch-4.18-rt.patch.xz | patch -p1

Configure the Kernel

When building a more recent version of kernel (possibly with an specific patch), the configuration should at first be kept as close as possible to the current running kernel, shown by uname -r. It is sufficient to just copy the currently-used kernel config to the source directory.

cd /opt/kernel/linux-source-4.18
cp /boot/config-4.18.0-kali2-amd64 .config

If you need to make some changes or decide to reconfigure all things from scratch, just call make menuconfig command and inspect all the details. Note: we can tweak a lot in this phase.

Write Some Code

Add one line of code for test(fun), in file init/main.c, start_kernel function

pr_notice("Brooke's customized kernel starting: %s %d\n", __FUNCTION__, __LINE__);
pr_notice("%s", linux_banner);

Build the Kernel

Once configured, we can make the kernel. Rather than invoking make deb-pkg as the official doc suggested, we use make bindeb-pkg here, which will not generate Debian source package, or invoke make clean.

time make -j4 bindeb-pkg LOCALVERSION=-custom KDEB_PKGVERSION=$(make kernelversion)-$(date +%Y%m%d)

After a while, we get following package in the parent directory

linux-headers-4.18.10-custom_4.18.10-20181021_amd64.deb   # headers
linux-image-4.18.10-custom_4.18.10-20181021_amd64.deb     # kernel image
linux-image-4.18.10-custom-dbg_4.18.10-20181021_amd64.deb # kernel image with debugging symbols
linux-libc-dev_4.18.10-20181021_amd64.deb                 # headers of user-space library

Trouble Shooting

No rule to make target 'debian/certs/test-signing-certs.pem', needed by 'certs/x509_certificate_list'. Stop

Solve: comment/delete the corresponding config line.

Install/Remove the Kernel

dpkg -i ../linux-image-4.18.10-custom_4.18.10-custom_4.18.10-20181021_amd64.deb
reboot

Once booted, we can use dmesg to verify our printk message. Removing kernel can also be done with dpkg.

dpkg -r linux-image-4.18.10-custom

[Solved] FFMPEG: Libmp3lame Not Found

Oct 24th, 2017 3:09 pm | Comments

ERROR: libmp3lame >= 3.98.3 not found

This configure error occurs when building ffmpeg from with mp3 library (libmp3lame) enabled.

solution: add -lm to extra-libs, like this

./configure \
--extra-libs="-lpthread -lm"
...  # other configure options

Done!

HOW

Check ffbuild/config.log, which shows dynamic linking error of libmp3lame, which requires the math lib.

util.c:(.text+0x96d): undefined reference to `__exp_finite'
util.c:(.text+0xa41): undefined reference to `__pow_finite'      
layer3.c:(.text+0x2250): undefined reference to `sin'
layer3.c:(.text+0x226a): undefined reference to `cos'       

Yeah!

Nodejs Circular Dependencies

Jun 28th, 2016 1:23 pm | Comments

模块循环依赖问题背景

在Node.js中，模块间的循环依赖(Circular Dependencies)引用如果处理不好，往往会导致很难调试的问题。

常见的问题情形是：新加入一个模块后，先前的某个模块就在开始载入时成为部分加载状态，导致依赖该它的其他模块找不到本应导出的变量或方法。导致模块链接在运行时出错。

Node.js的module系统支持部分加载

代码实例如下：

'use strict';
const user = require('./user');

console.log('# in auth: user is');
console.log(user);

function authenticate(name) {
    return user.find(name);
}

function enabled() {
    return true;
}

module.exports = {
    authenticate: authenticate,
    enabled: enabled
};

console.log('# auth loaded');

'use strict';
const message = require('./message');

console.log('# in user: message is');
console.log(message);

function find(name) {
    console.log('found user: ' + name);
    message.hello(name);
    return name;
}

module.exports = {
    find: find
};

console.log('# user loaded');

'use strict';
const auth = require('./auth');

console.log('# in message: auth is');
console.log(auth);

function hello(name) {
    console.log('hello, ' + name);
    // if (auth.enabled(name)) console.log('hello, ' + name);
}

module.exports = {
    hello: hello
};

console.log('# message loaded');

'use strict';
console.log('# main starting');
const auth = require('./auth');
console.log('# in main, auth is');
console.log(auth);
console.log('start running...');
auth.authenticate('Alice');

运行结果

# main starting
# in message: auth is
{}
# message loaded
# in user: message is
{ hello: [Function: hello] }
# user loaded
# in auth: user is
{ find: [Function: find] }
# auth loaded
# in main, auth is
{ authenticate: [Function: authenticate],
      enabled: [Function: enabled] }
start running...
found user: Alice
hello, Alice

三个模块的关系为：auth模块引用user模块，并调用了其导出的find的方法；user模块引用message模块，并调用其导出的hello方法；message模块引用auth模块，但没有调用其任何导出的方法。 main模块引用了auth模块，并调用了其导出的authenticate方法。即程序存在模块循环依赖，但不存在循环模块引用（方法间的循环调用）。

从结果我们可以看出：main函数应用auth模块时，vm根据require规则相继递归加载了user和message模块，message模块先完成全部加载（接着相继是user、auth、main）。注意到，在message加载时，由于auth模块尚未完成加载，在message模块内表现为一个临时的空对象，即部分加载。之后的main加载时，auth模块已经全部完成初始化加载过程，因此可以正常使用了。

问题引入

改变message中的hello方法实现（第8、9行），引入循环依赖。

'use strict';
const auth = require('./auth');

console.log('# in message: auth is');
console.log(auth);

function hello(name) {
    // console.log('hello, ' + name);
    if (auth.enabled(name)) console.log('hello, ' + name);
}

module.exports = {
    hello: hello
};

console.log('# message loaded');

再次运行，结果如下

# main starting
# in message: auth is
{}
# message loaded
# in user: message is
{ hello: [Function: hello] }
# user loaded
# in auth: user is
{ find: [Function: find] }
# auth loaded
# in main, auth is
{ authenticate: [Function: authenticate],
  enabled: [Function: enabled] }
start running...
found user: Alice
/xxxx/xxxxx/xxxxx/xxxx/message.js:8
    if (auth.enabled(name)) console.log('hello, ' + name);
             ^

TypeError: auth.enabled is not a function

有了之前的分析，错误的原因也就不难分析了：message模块加载时，auth模块还未初始化完成，即此时还未用已经初始化完成的正确对象来覆盖module.exports对象，从临时的空对象里当然就找不到enabled方法。

循环依赖解决

那么可不可以“毕其功于一役”，让每个模块既能在加载时就导出正确的对象，同时将模块中的方法定义和实现推移至运行时再动态改变呢？

前置模块导出

前置模块导出的方法，可以做到将模块方法的定义和挂载延迟到运行时，而非在模块第一次加载完成后再用一个变量统一覆盖掉module.exports对象(像举例中的传统模块声明方法)。因此更高效地利用了部分加载的特性，让模块间彼此更加独立、协作更好。对于刚才的问题，具体的优化实现如下。

'use strict';
const Module = module.exports;

const user = require('./user');

console.log('# in auth: user is');
console.log(user);

function authenticate(name) {
    return user.find(name);
}

function enabled() {
    return true;
}

Module.authenticate = authenticate;
Module.enabled = enabled;

console.log('# auth loaded');

'use strict';
const Module = module.exports;

const message = require('./message');

console.log('# in user: message is');
console.log(message);

function find(name) {
    console.log('found user: ' + name);
    message.hello(name);
    return name;
}

Module.find = find;

console.log('# user loaded');

'use strict';
const Module = module.exports;

const auth = require('./auth');

console.log('# in message: auth is');
console.log(auth);

function hello(name) {
    //console.log('hello, ' + name);
    if (auth.enabled(name)) console.log('hello, ' + name);
}

Module.hello = hello;

console.log('# message loaded');

# main starting
# in message: auth is
{}
# message loaded
# in user: message is
{ hello: [Function: hello] }
# user loaded
# in auth: user is
{ find: [Function: find] }
# auth loaded
# in main, auth is
{ authenticate: [Function: authenticate],
  enabled: [Function: enabled] }
start running...
found user: Alice
hello, Alice

引用的错误解决，运行正确。

基于这种设计的模块前置导出技术，即“加载时先导出模块、运行时后实现方法”的模块实现思路，可应用于存在多模块间循环依赖引用的情况，从而方便实现逻辑关系更复杂的模块。

← Older Blog Archives