Linux kernel & driver exploration

Tue Mar 22 2022

Authored by Kevin Jackson (Principal Software Engineer at THG)

Recently at THG, we have been working closely with linux internals to develop networking software required as part of a security research project: Soteria. The research is split into multiple strands, our section of the work involves low-level programming in C with the linux kernel and should eventually lead to some FPGA development in the latter part of the programme.

During this work we’ve had to dig through the details of how the kernel calls specific driver code and how our user-space application interacts with both the kernel and the drivers.

post 5 image 1

Mellanox OFED driver stack (not 100% identical to open source version)

As the research is focused on specific hardware capabilities, we have configured a network lab environment with appropriate network interfaces: Mellanox Connect-x 5 SmartNICs. These cards support the specific hardware capabilities required to test with and more importantly the driver source code is shipped directly in-kernel.

Kernel Modules & simple debugging

Modern linux kernels use a ‘kernel module’ system to allow drivers to be dynamically loaded / unloaded. This system is fairly well-documented and has simple enough tooling to interact with and this is where our investigation of the drivers we were interested in started.

As our work was with the network devices we needed to discover which network interfaces where available: ip a to show all interfaces then: ethtool -i <interface> to display some useful information (we’ll return to ethtool later).

To remove a loaded module: sudo rmmod <modname> , to load a new module: sudo insmod <modname>. When working with drivers it’s sometimes useful to customise the code to add more debugging messages using printk() calls to print to /var/log/messages and this simple debugging is facilitated by unloading and reloading the kernel module.

This simple printf + binary chop is a standard debugging tool that almost every developer will reach for at some point (regardless of language/framework etc.) as it is very simple and can quickly get to the source of errors without any advanced tooling or learning a specific debugger. It’s not very sophisticated but it does the job, up to a point.

Manually tracing codepaths

Another exploration technique is to simply open up the source code and follow the various code being executed, jumping from function call to function call and making note of how the code is composed.

Drawing out the path a single call follows through the code is useful for gaining an understanding of the complexity of the code that you’re investigating along with finding the source of an error.

From this process a simple call-graph of the function calls can be created.

ioctl & Ethtool

The code that we needed to work with is ethtool — this is a user-space tool that is shipped with linux and is used to configure network interface cards. To perform configuration options, a specific command is passed to ioctl via a socket and this command is then passed from the kernel to the device driver

post 5 image 2

Just some of the supported ethtool commands…

To combine the ETHTOOL_* command with ioctl is relatively straight-forward. To begin with we simply create a socket socket(AF_INET, SOCK_DGRAM, IPPROTO_IP) . Then we set the values of the request ifreq including the correct ethtools_channels command. Finally we call ioctl with the ifreq to get the channels information:

#include <sys/ioctl.h>
#include <sys/socket.h>
#include <net/if.h>
#include <linux/ethtool.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
#include <linux/sockios.h>
#include <stdio.h>
#include <unistd.h>
#include <string.h>
#include <errno.h>

typedef struct example_ethtools_channels example_ethtools_channels_t;

struct example_ethtools_channels
    unsigned int num_rx_channels; /**< Number of rx channels */
    unsigned int num_tx_channels; /**< Number of tx channels */
    unsigned int num_other_channels; /**< Number of other channels */
    unsigned int num_combined_channels; /**< Number of combined channels */

size_t string_strlcpy(char *dst, const char *src, size_t size)
    size_t len;
    len = strlen(src);
    if (len > size)
        len = size;
    memcpy(dst, src, len);
    dst[len] = '\0';
    return len;

int example_ethtool_get_channels(const char *if_name, const unsigned int if_idx, example_ethtools_channels_t *channels) {
    int fd = -1;
    struct ifreq ifr;
    struct ethtool_channels ethchannels;

    channels->num_rx_channels = 0;
    channels->num_tx_channels = 0;
    channels->num_other_channels = 0;
    channels->num_combined_channels = 0;

    fd = socket(AF_INET, SOCK_DGRAM, IPPROTO_IP);
    if (fd == -1)
        printf("cannot get socket");
        return -1;

    /* Set the interface name in this request. */
    memset(&ifr, 0, sizeof(struct ifreq));
    string_strlcpy(ifr.ifr_name, if_name, sizeof(ifr.ifr_name));

    /* Set interface index for the request. */
    ifr.ifr_ifindex = if_idx;

    /* Set the ethtool command */
    memset(&ethchannels, 0, sizeof(struct ethtool_channels));
    ethchannels.cmd = ETHTOOL_GCHANNELS;

    /* Set the ifrequest data to point to the ethtool command and submit. */
    ifr.ifr_data = (void *)&ethchannels;

    if (ioctl(fd, SIOCETHTOOL, &ifr) == 0) {
        printf("max_rx[%d] max_tx[%d] max_other[%d] max_combined[%d]\n", ethchannels.max_rx, ethchannels.max_tx, ethchannels.max_other, ethchannels.max_combined);
        channels->num_rx_channels = ethchannels.rx_count;
        channels->num_tx_channels = ethchannels.tx_count;
        channels->num_other_channels = ethchannels.other_count;
        channels->num_combined_channels = ethchannels.combined_count;
    return 0;

int main(int argc, char **argv)
    printf("Channels info for: %s\n", argv[1]);
    char *if_name = argv[1];
    unsigned int if_idx = if_nametoindex(if_name);
    example_ethtools_channels_t ch;

    example_ethtool_get_channels(if_name, if_idx, &ch);

    return 0;
Fetching channel information from a NIC using ETHTOOL_GCHANNELS

This example shows how to work with the ETHTOOL_ commands, however our application needed to steer traffic to a specific channel and to do that involved diving into the device driver itself.

Ethtool uses driver supplied functions via a function table ethtool_ops

post 5 image 3

A portion of the mellanox driver’s ethtool_ops function mapping table

These functions are called from ioctl

post 5 image 4

Here you can see the call from ioctl to use the function set_channels that is provided by the device in the ethtool_ops table. This eventually calls mlx5e_set_channels.

Digging deeper

One of the features of modern linux we are interested in most is eXpress Data Path (XDP), an alternative to Data Plane Development Kit (DPDK), and while developing our application we noticed something significant when trying to use XDP with zero-copy mode. The channels that the traffic should have been available on were not receiving packets.

After checking in with the Mellanox devs and reading around the drivers and the driver source code, we came to understand that the Mellanox driver applies an offset to the channel id or number when XDP is enabled. So if a NIC is configured with 4 channels (0–3 usually) with XDP enabled, the actual channels would be 4–7 instead.

As the application we were working on required XDP, we needed to understand the mechanisms in the Mellanox driver around RSS flow hashing which doesn’t automatically shift from the normal queues / channels to the offset queues / channels when XDP is enabled.

We want to programmatically apply the same offset so that we can perform the equivalent of steering the traffic to the correct queue as documented in the linux kernel documentation:

sudo ethtool -N <interface> rx-flow-hash udp4 fn sudo ethtool -N <interface> flow-type udp4 src-port 4242 dst-port \ 4242 action 2

With an understanding of how to use ethtool & ioctl, we now needed to work out how the Mellanox driver was being used after the command to set channels is sent.

Starting with the initial ETHTOOL_SCHANNEL (set channels) command being issued, and where it is used inside ethtool, we hand traced through the driver code following each function call to work out which code paths were executed and how this impacted the channel ids and packet flow.

post 5 image 5

Manually traced call-graph from set_channels command

Now we have manually inspected the code to determine the codepath, can we also use the tracing and debugging frameworks built into the linux kernel to come up with a similar path and verify this manual detective work?


As tracing strategies have evolved in the industry (DTrace etc), there have been a few attempts to add good tracing capabilities into the linux kernel — the current implementation is based on a macro TRACE_EVENT which enables developers to add a tracepoint without coupling the tracepoint to any particular implementation of a specific tracer.

Using TRACE_EVENTmacro will obviously require changes to the code to add both a new TRACE_EVENTand to add a call to use the newly added TRACE_EVENT.

Given our interest in the function mlx5e_open_xsk we will need to add code to mlx5e_open_channel as the callsite:

post 5 image 6

To ensure that the trace_mlx53_open_xsk is actually called regardless of the state of XDP, I placed the call to the trace function outside of any XDP related checks. The obvious location to place the call to trace_mlx5e_open_xsk is immediately after the call to mlx5e_open_xsk in the conditional block.

The actual TRACE_EVENT is defined in two files, xsk_tracepoint.h and xsk_tracepoint.c:

#define TRACE_SYSTEM mlx5

#if !defined(_MLX5_XSK_TP_) || defined(TRACE*HEADER_MULTI_READ)
#define \_MLX5_XSK_TP*

#include <linux/tracepoint.h>
#include "../en.h"

TP_PROTO(struct mlx5e_channel \_ch),
**field(struct mlx5e_channel *, ch)
**field(int, ix)
**field(u8, num_tc)
**entry->ch = ch;
**entry->ix = ch->ix;
**entry->num_tc = ch->num_tc;
TP_printk("ix=%d \n", **entry->ix)

#endif /_ *MLX5_XSK_TP*_/

/_ This part must be outside protection _/
#define TRACE_INCLUDE_PATH ./diag
#define TRACE_INCLUDE_FILE xsk_tracepoint
#include <trace/define_trace.h>
Adding a TRACE_EVENT for mlx5e driver


#include "xsk_tracepoint.h"



With the new trace code added to my copy of the driver source I need to compile a new kernel, install the new driver module, set grub to boot into latest kernel by default (important step) and then reboot…

  • sudo make -j4
  • sudo make modules_install && make install
  • sudo grub2-set-default 0
  • sudo grub2-mkconfig -o /boot/grub2/grub.cfg
  • sudo shutdown -r now

With that compiling housekeeping out of the way, we can log into our newly “customised kernel” and see if the TRACE_EVENT has been added to the list of available.

To actually enable tracepoints we need to add them to the tracing path:

echo *:* > /sys/kernel/debug/tracing/set_event

This command enable all events to be traced — which fills the trace completely with information that we’re not actually interested in a more focused version is:

echo mlx5:* > /sys/kernel/debug/tracing/set_event

We can now use perf list to show all the available tracepoints — hardware and software defined, including the newly added mlx5e_open_xsk.

post 5 image 7

List of available mlx5 tracepoints including new mlx5e_open_xsk

Now that the new tracepoint is available it’s time to use perf to record samples while interacting with ethtool to cause the traced code to execute.

To record the samples with perf we need to run the following:

perf record -e 'mlx5e:*' -a -g --call-graph dwarf

While that process is running we then interact with ethtool to set channels, e.g.:

ethtool -L <dev name> combined 4

The recorded data is saved in and can be viewed and inspected with perf report.

post 5 image 8

Using perf report on the recorded data

The report generated by perf shows that the manual trace is correct and we can start to use this understanding of the behaviour of the driver to build our application calling the same (or similar) kernel functions and APIs.

The perf tool gives a good understanding of the callgraph and which parts of this are using the most time — it is designed for understanding and debugging performance issues after all! Another place to look for more information from the TRACE_EVENT macro is in /sys/kernel/debug/tracing/trace which will contain the output of the TP_printk.

post 5 image 9

The contents of trace after an ethtool setchannels call

In this output we can see the ix values printed from TP_printk which show that in this instance the queues are configured 0–7 which in the Mellanox drivers are the default queue numbers, not the XDP zero-copy queue numbers (which would be 8–15).

With proof that the TRACE_EVENT macro was working, we could see that the channels with the XSK queues were being initialised at the point where we expected.


The areas of the code we’ve been looking at are under active development by the folks at Mellanox and as the linux kernel advances, some of the trace points we had added were no longer useful as the code it was tracing had been refactored so much it now made no sense to trace that particular code path.

As we continue our research, we will continue using perf and TRACE_EVENT code in our own branches of the code and work to add sensible tracepoints for diagnostics under drivers/net/ethernet/mellanox/mlx5/core/diag.