On Linux Netlink
Authored by Kevin Jackson (Principal Software Engineer at THG)
When you are writing a linux application that needs either kernel to userspace communications or userspace to kernel communications, the typical answer is to use ioctl
and sockets.
This is a simple mechanism for sending information down from userspace into the kernel to make requests for info, or to direct the kernel to perform an operation on behalf of the userspace application.
A good example of this type of communications between a userspace application and the kernel can be found in the venerable ethtool
config application. Here the tool itself is a userspace application that communicates via sockets to the kernel. The kernel contains the API that the application uses to perform the communications.
Example: Setting a NICs channels
Let’s look at an example usecase of ethtool with a modern multi-queue network interface (NIC). Modern NICs have the hardware and ability to use multiple channels for sending & receiving packets. These take advantage of multi-core CPUs to balance the load of transmitting (Tx) and receiving (Rx) traffic. Historically all the traffic (and associated interrupts) was handled by a single core, spreading the workload across multiple cores can significantly improve performance.
How would we set the combined
channel number on a NIC that supports the feature using ioctl
?
#include <sys/ioctl.h>
#include <sys/socket.h>
#include <net/if.h>
#include <linux/ethtool.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
#include <linux/sockios.h>
#include <stdio.h>
#include <unistd.h>
#include <string.h>
#include <errno.h>
typedef struct example_ethtools_channels example_ethtools_channels_t;
struct example_ethtools_channels
{
unsigned int num_rx_channels; /**< Number of rx channels */
unsigned int num_tx_channels; /**< Number of tx channels */
unsigned int num_other_channels; /**< Number of other channels */
unsigned int num_combined_channels; /**< Number of combined channels */
};
size_t string_strlcpy(char *dst, const char *src, size_t size)
{
size_t len;
len = strlen(src);
if (len > size)
len = size;
memcpy(dst, src, len);
dst[len] = '\0';
return len;
}
int set_channels(const char *if_name, const unsigned int if_idx, const example_ethtools_channels_t *channels)
{
int fd = -1;
int result = 0;
struct ifreq ifr;
struct ethtool_channels ethcmd;
fd = socket(AF_INET, SOCK_DGRAM, IPPROTO_IP);
if (fd == -1) {
return -1;
}
/* Set the interface name in this request. */
memset(&ifr, 0, sizeof(ifr));
string_strlcpy(ifr.ifr_name, if_name, sizeof(ifr.ifr_name));
/* Set the ethtool command */
ethcmd.cmd = ETHTOOL_SCHANNELS;
ethcmd.rx_count = channels->num_rx_channels;
ethcmd.tx_count = channels->num_tx_channels;
ethcmd.other_count = channels->num_other_channels;
ethcmd.combined_count = channels->num_combined_channels;
printf("setting rx_count[%d] tx_count[%d] other_count[%d] combined_count[%d]\n",
channels->num_rx_channels,
channels->num_tx_channels,
channels->num_other_channels,
channels->num_combined_channels);
/* Set the ifrequest data to point to the ethtool command and submit. */
ifr.ifr_data = ðcmd;
int status = ioctl(fd, SIOCETHTOOL, &ifr);
if (status != 0) {
printf("error setting channels [%d]\n", status);
printf("errno: %d\n", errno);
result = -1;
}
close(fd);
return result;
}
int main(int argc, char **argv)
{
char *if_name = argv[1];
unsigned int if_idx = if_nametoindex(if_name);
example_ethtools_channels_t ch;
ch.num_combined_channels = 4;
ch.num_rx_channels = 0;
ch.num_tx_channels = 0;
ch.num_other_channels = 0;
printf("Setting new channel details...\n");
int result = set_channels(if_name, if_idx, &ch);
if(result == -1) {
printf("failed to set channels\n");
return 1;
}
return 0;
}
Here we can see the standard method of working with ioctl
to communicate to the kernel and set the required channels on a NIC. The important parts of the code to take note of are:
include <linux/ethtool.h>
fd = socket(AF_INET, SOCK_DGRAM, IPPROTO_IP);
ETHTOOL_SCHANNELS
It’s clear to see from these that, ethtool is using a standard socket to communicate with the kernel. Also that the exact command that is being sent from the userspace application to the kernel is defined in the kernel headers.
It’s a very simple programming model; open a socket, fill structure with appropriate info and command, send down socket. However this simplicity of programming model comes at a very high cost; the application is necessarily tightly coupled to the kernel (the exact command you want to send to the kernel must already be defined in the kernel headers).
So ioctl
, and syscalls (the traditional methods of communication between userspace and kernel) are simple to use but have this major pain point of requiring kernel level changes to implement the protocol. Obviously this makes any request to add these to the kernel onerous to the kernel development community, and there is no guarantee that any such additions will be completed in a timely manner (timely in the sense of a userspace application being blocked on adding functionality until the kernel has been modified).
The proposed solution to this issue has it’s roots in the linux networking space: netlink.
Netlink
In contrast to the previous communications options between application and kernel, to add a new protocol with netlink requires a simple addition of a constant to netlink.h
then the kernel and application can immediately communicate via a sockets-based API.
The original goal of netlink was to provide a better way of modifying network related settings and transferring network related information between userspace and kernel. Importantly, the communications between userspace and kernel is bi-directional or rather the netlink socket is a duplex socket.
With this new means of communication both to and from the kernel, there is now a great way of developing applications, that by design, need frequent update events directly from the kernel. What started as more effective means to relay and modify network related information has become a generic kernel and userspace communications fabric via NETLINK_GENERIC
.
The downsides
All of the advantages of netlink over syscalls or ioctl
sound fantastic, however there is a catch. The simplicity of sending & receiving a message using ioctl
is gone, netlink itself is a more complex messaging system — particularly in terms of constructing the messages themselves.
Netlink message format
A netlink message is a byte stream consisting of one or more headers plus payloads of data. Each message is byte aligned and requires padding if the payload isn’t an exact fit. A message can contain multiple headers.
The message header contains the message length, type, flags, sequence number and process id:
struct nlmsghdr {
**u32 nlmsg*len; /* Length of message including header \_/
**u16 nlmsg*type; /* Message content _/
\*\*u16 nlmsg_flags; /_ Additional flags _/
\*\*u32 nlmsg_seq; /_ Sequence number _/
\_\_u32 nlmsg_pid; /_ Sending process port ID \_/
};
Following this header is the payload (with or without padding to align to the fixed length of the message):
A message can contain a second header defining the type of netlink message; the most common of these are:
NETLINK_ROUTE
for modifying routing tables, queuing, traffic classifiers etc.NETLINK_NETFILTER
for netfilter related informationNETLINK_KOBJECT_UEVENT
for communications from kernel to userspace (for an application to subscribe to kernel events)NETLINK_GENERIC
for users to develop application specific messages
The payload of a message follows the header and consists of a data expressed in a TLV (Type, Length, Value) format (although in the case of Netlink, it’s actually Length, Type, Value):
A simple message will consist of a single header followed by one or more TLV formatted attributes. However many messages have an optional additional header.
All netlink messages are byte aligned to a 4 byte size using the macro NLMSG_ALIGNTO
, additional unused space must be padded with 0 to complete the message.
Now we know the makeup of a netlink message, how do we put it all together and actually communicate with the kernel to achieve something useful?
Example: Setting NIC state (up or down)
A good place to start is to use netlink for it’s original intent: communicating with the kernel to modify the settings of a network interface.
At the most simple level, we can use NETLINK_ROUTE
to switch a NIC from state UP to DOWN (or vice versa).
This follows a straight-forward pattern:
- Open a socket with
AF_NETLINK
as the family - Set the address id to 0 (for kernel)
- Create the standard message header
- Attach the correct payload (including the
ifinfomsg
and the NIC name) - Call
sendmsg
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <time.h>
#include <errno.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
#include <linux/sockios.h>
#include <linux/if.h>
#include <linux/if_link.h>
#include <linux/rtnetlink.h>
#define ALIGNTO 4
#define ALIGN(len) (((len)+ALIGNTO-1) & ~(ALIGNTO-1))
#define ATTR_HDRLEN ALIGN(sizeof(struct nlattr))
#define SOCKET_BUFFER_SIZE (sysconf(_SC_PAGESIZE) < 8192L ? sysconf(_SC_PAGESIZE) : 8192L)
int main()
{
int nls = -1;
struct sockaddr_nl kernel_nladdr;
struct iovec io;
struct msghdr msg;
struct ifinfomsg *ifm;
unsigned int change, flags, seq;
char *ifname;
char buf[SOCKET_BUFFER_SIZE]; /* 8192 by default */
struct nlmsghdr *nlmsg;
seq = time(NULL);
/* The netlink message is destined to the kernel so nl_pid == 0. */
memset(&kernel_nladdr, 0, sizeof(kernel_nladdr));
kernel_nladdr.nl_family = AF_NETLINK;
kernel_nladdr.nl_groups = 0; /* unicast */
kernel_nladdr.nl_pid = 0;
nls = socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE);
if (nls == -1)
{
printf("cannot open socket %s\n", strerror(errno));
return -1;
}
int br;
br = bind(nls, (struct sockaddr *) &kernel_nladdr, sizeof (kernel_nladdr));
if (br == -1)
{
printf("cannot bind to socket\n");
return -1;
}
int hlen = ALIGN(sizeof(struct nlmsghdr));
nlmsg = buf;
memset(buf, 0, hlen);
nlmsg->nlmsg_len = hlen;
nlmsg->nlmsg_type = RTM_NEWLINK;
nlmsg->nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK;
nlmsg->nlmsg_seq = seq;
/* extra header */
char *ptr = (char *)nlmsg + nlmsg->nlmsg_len;
size_t ehlen = ALIGN(sizeof(*ifm));
nlmsg->nlmsg_len += ehlen;
memset(ptr, 0, ehlen);
/* put interface down */
change = 0;
flags = 0;
change |= IFF_UP;
flags &= ~IFF_UP; /* down = !up, obviously */
ifm = (void *)ptr;
ifm->ifi_family = AF_UNSPEC;
ifm->ifi_change = change;
ifm->ifi_flags = flags;
/* add payload details - nlattr & padding */
ifname = "eth0";
struct nlattr *attr = (void *)nlmsg + ALIGN(nlmsg->nlmsg_len);
uint16_t payload_len = ALIGN(sizeof(struct nlattr)) + strlen(ifname);
int pad;
attr->nla_type = IFLA_IFNAME;
attr->nla_len = payload_len;
memcpy((void *)attr + ATTR_HDRLEN, ifname, strlen(ifname));
pad = ALIGN(strlen(ifname)) - strlen(ifname);
if (pad > 0)
memset((void *)attr + ATTR_HDRLEN + strlen(ifname), 0, pad);
nlmsg->nlmsg_len += ALIGN(payload_len);
/* end of inner netlink nlattr details */
/* Stick the request in an io vector */
io.iov_base = (void *)nlmsg;
io.iov_len = nlmsg->nlmsg_len;
/* Wrap it in a msg */
memset(&msg, 0, sizeof(msg));
msg.msg_iov = &io;
msg.msg_iovlen = 1;
msg.msg_name = (void *)&kernel_nladdr;
msg.msg_namelen = sizeof(kernel_nladdr);
/* Send it */
int res = sendmsg(nls, &msg, 0);
printf("result of send: %d", res);
return 0;
}
With this compiled to a binary ifd
we can see that this will alter the NIC stat from UP to DOWN.
[admin@dev-002 ~]$ ip a show eth0
4: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether b8:59:9f:bf:55:7e brd ff:ff:ff:ff:ff:ff
inet xxx.xxx.xxx.xxx/24 brd xxx.xxx.xxx.xxx scope global noprefixroute eth0
valid_lft forever preferred_lft forever
inet6 fe80::a8c0:8adc:61a1:d4e7/64 scope link noprefixroute
valid_lft forever preferred_lft forever
[admin@dev-002 ~]$ sudo ./ifd
result of send: 48[admin@dev-002 ~]$ ip a show eth0
4: eth0: <BROADCAST,MULTICAST> mtu 1500 qdisc mq state DOWN group default qlen 1000
link/ether b8:59:9f:bf:55:7e brd ff:ff:ff:ff:ff:ff
inet xxx.xxx.xxx.xxx/24 brd xxx.xxx.xxx.xxx scope global noprefixroute eth0
valid_lft forever preferred_lft forever
[admin@dev-002 ~]$
Things to note about this example however, this is using NETLINK_ROUTE
which is one of the initial implementations of netlink. Similar to ioctl
this type of netlink socket is defined as part of the netlink headers. This makes communications with the kernel relatively simple.
However the message construction is essentially the same regardless of the type of netlink socket is in use. So the next step is moving from NETLINK_ROUTE
type messages to the newer NETLINK_GENERIC
type. This involves additional complexity of setting up the connection so that the kernel registers the connection and returns a ‘family id’, prior to sending a command.
Example: Setting channels with Netlink
Unlike setting an interface up (or indeed down), changing the number of Rx/Tx/Combined/Other channels on a NIC with netlink doesn’t use a specific type of netlink message.
For changing the state of an interface we could use NETLINK_ROUTE
which is a defined type in the linux kernel, for setting channels we need to use NETLINK_GENERIC
messages. This leads to a small complication that the previous example avoids: sending an initial message to register with the kernel and receiving the correct family
id from the kernel before we can send our ‘update channel numbers’ message!
With our previous netlink example code, the application only had to send a single command to the kernel; now, however, the application needs to send an initial command and interpret the response from the kernel prior to sending a subsequent command to perform the desired action.
Sending the initial message is similar to the example above. As we will be setting many nlattr
values in this code, and getting the padding of the attributes correct is error prone, we used the libmnl library as a source for some helper functions, including nl_attr_put
:
/**
* nl_attr_put - add an attribute to netlink message
* @param nlh pointer to the netlink message
* @param type netlink attribute type that you want to add
* @param len netlink attribute payload length
* @param data pointer to the data that will be stored by the new attribute
*
* This function updates the length field of the Netlink message (nlmsg_len)
* by adding the size (header + payload) of the new attribute.
* From libmnl: https://www.netfilter.org/projects/libmnl/index.html
*/
void nl_attr_put(
struct nlmsghdr *nlh,
uint16_t type,
size_t len,
const void *data)
{
struct nlattr *attr = (void *)nlh + NL_ALIGN(nlh->nlmsg_len);
uint16_t payload_len = NL_ALIGN(sizeof(struct nlattr)) + len;
int pad;
attr->nla_type = type;
attr->nla_len = payload_len;
memcpy((void *)attr + NL_ATTR_HDRLEN, data, len);
pad = NL_ALIGN(len) - len;
if (pad > 0)
memset((void *)attr + NL_ATTR_HDRLEN + len, 0, pad);
nlh->nlmsg_len += NL_ALIGN(payload_len);
}
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <time.h>
#include <errno.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
#include <linux/sockios.h>
#include <linux/if.h>
#include <linux/if_link.h>
#include <linux/rtnetlink.h>
#define ALIGNTO 4
#define ALIGN(len) (((len)+ALIGNTO-1) & ~(ALIGNTO-1))
#define ATTR_HDRLEN ALIGN(sizeof(struct nlattr))
#define SOCKET_BUFFER_SIZE (sysconf(_SC_PAGESIZE) < 8192L ? sysconf(_SC_PAGESIZE) : 8192L)
/* these are from the ethtool src code */
#define ETHTOOL_GENL_NAME "ethtool"
#define ETHTOOL_GENL_VERSION 1
typedef struct netlink_sock netlink_sock_t;
typedef struct nl_family_info nl_family_info_t;
struct nl_family_info
{
uint16_t id;
char *name;
};
struct netlink_sock
{
struct sockaddr_nl nladdr; /**< Netlink socket address */
int fd;
unsigned int last_seq;
nl_family_info_t *fi;
};
int netlink_send_init_msg(netlink_sock_t *nls, int family, int msg_type)
{
struct iovec io;
struct msghdr imsg;
struct genlmsghdr *gennlhdr;
char *ifname;
char buf[SOCKET_BUFFER_SIZE];
/* The netlink message is destined to the kernel so nl_pid == 0. */
memset(&nls->nladdr, 0, sizeof(nls->nladdr));
nls->nladdr.nl_family = family;
nls->nladdr.nl_groups = 0; /* unicast */
nls->nladdr.nl_pid = 0;
/* send init msg saying 'hello I am ethtool' */
int ilen = NL_ALIGN(sizeof(struct nlmsghdr));
struct nlmsghdr *nlh;
nlh = buf;
memset(nlh, 0, ilen);
nlh->nlmsg_len = ilen;
nlh->nlmsg_type = msg_type;
nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK;
nlh->nlmsg_seq = ++nls->last_seq;
/* extra header */
char *iptr = (char *)nlh + nlh->nlmsg_len;
size_t ehilen = NL_ALIGN(sizeof(struct genlmsghdr));
nlh->nlmsg_len += ehilen;
memset(iptr, 0, ehilen);
struct genlmsghdr *genilhdr;
genilhdr = (void *)iptr;
genilhdr->cmd = CTRL_CMD_GETFAMILY;
genilhdr->version = 1;
char *genl_name;
genl_name = ETHTOOL_GENL_NAME;
nl_attr_put(nlh, CTRL_ATTR_FAMILY_NAME, strlen(genl_name)+1, genl_name);
struct iovec iio;
/* Stick the request in an io vector */
iio.iov_base = (void *)nlh;
iio.iov_len = nlh->nlmsg_len;
struct msghdr imsg;
/* Wrap it in a msg */
memset(&imsg, 0, sizeof(imsg));
imsg.msg_iov = &iio;
imsg.msg_iovlen = 1;
imsg.msg_name = (void *)&nls->nladdr;
imsg.msg_namelen = sizeof(nls->nladdr);
/* Send it */
int res = sendmsg(nls->fd, &imsg, 0);
printf("Send result: %d", res);
}
The initial message is sent to the kernel nladdr.nl_pid = 0;
with the command CTRL_CMD_GETFAMILY
and a name “ethtool”. With this initial message sent, the kernel will respond with a netlink message of it’s own including the details of the family
id it wants subsequent messages to contain.
/* This is a heavily simplified version of the code we use to read the reply
* e.g. it ignores error handling
*/
#define NL_ATTR_HDRLEN NL_ALIGN(sizeof(struct nlattr))
#define NL_NLMSG_HDRLEN NL_ALIGN(sizeof(struct nlmsghdr))
enum {
CTRL_ATTR_UNSPEC,
CTRL_ATTR_FAMILY_ID,
CTRL_ATTR_FAMILY_NAME,
CTRL_ATTR_VERSION,
CTRL_ATTR_HDRSIZE,
CTRL_ATTR_MAXATTR,
CTRL_ATTR_OPS,
CTRL_ATTR_MCAST_GROUPS,
CTRL_ATTR_POLICY,
CTRL_ATTR_OP_POLICY,
CTRL_ATTR_OP,
__CTRL_ATTR_MAX,
};
#define CTRL_ATTR_MAX (__CTRL_ATTR_MAX - 1)
char *string_strldup(const char *src, size_t size)
{
char *dst;
size_t len;
len = strlen(src);
if (len > size)
len = size;
dst = calloc(len + 1);
strncpy(dst, src, len);
dst[len] = '\0';
return dst;
}
/* more helper functions adapted and simplfied from the code in libmnl */
void *nl_attr_get_payload(
const struct nlattr *attr)
{
return (void *)attr + NL_ATTR_HDRLEN;
}
const char *nl_attr_get_str(
const struct nlattr *attr)
{
return nl_attr_get_payload(attr);
}
uint16_t nl_attr_get_u16(
const struct nlattr *attr)
{
return *((uint16_t *)nl_attr_get_payload(attr));
}
uint16_t nl_attr_get_type(
const struct nlattr *attr)
{
return attr->nla_type & NLA_TYPE_MASK;
}
int nl_attr_type_valid(
const struct nlattr *attr,
uint16_t max
)
{
if (nl_attr_get_type(attr) > max) {
errno = EOPNOTSUPP;
return -1;
}
return 1;
}
int data_attr_cb(
const struct nlattr *attr,
void *data)
{
const struct nlattr **tb = data;
int type = nl_attr_get_type(attr);
if (nl_attr_type_valid(attr, CTRL_ATTR_MAX) < 0)
{
return NL_CB_OK;
}
tb[type] = attr;
return NL_CB_OK;
}
/* This is similar to code in ethtool source code, but has greatly reduced complexity */
int parse_gen_message(const struct nlmsghdr *nlh, void *data)
{
struct nlattr *tb[CTRL_ATTR_MAX+1] = {};
struct genlmsghdr *genl = (void *)nlh + NL_NLMSG_HDRLEN;
nl_attr_parse(nlh, sizeof(*genl), data_attr_cb, tb);
nl_family_info_t *fi = data;
if (tb[CTRL_ATTR_FAMILY_NAME]) {
char *name = nl_attr_get_str(tb[CTRL_ATTR_FAMILY_NAME]);
fi->name = string_strldup(name, 15);
printf("name=%s\t", fi->name);
}
if (tb[CTRL_ATTR_FAMILY_ID]) {
uint16_t id = (uint16_t *)nl_attr_get_u16(tb[CTRL_ATTR_FAMILY_ID]);
printf("id=%u\t",id);
fi->id = id;
}
return 0;
}
/**
* Parse a netlink message buffer.
*
* @param nls Pointer to the Netlink socket msg was received on
* @param buf Pointer to the buffer containing the message
* @param buflen Length of the provided buffer
*
* @retval Number of bytes parsed
*/
static int netlink_parse_nlmsg(netlink_sock_t *nls, char *buf, ssize_t buflen)
{
ssize_t remaining;
struct nlmsghdr *nlm;
int cnt = 0, res;
struct genlmsghdr *fam;
struct nl_family_info *fi;
remaining = buflen;
for (nlm = (struct nlmsghdr *)buf; NLMSG_OK(nlm, remaining); nlm = NLMSG_NEXT(nlm, remaining))
{
printf("NETLINK: rx message, nlmsg_type=%d", nlm->nlmsg_type);
if (nlm->nlmsg_type >= NLMSG_MIN_TYPE)
{
fi = calloc(1, sizeof(nl_family_info_t));
parse_gen_message(nlm, fi);
printf("NETLINK: NLMSG_MIN_TYPE for %s with family_id %u", fi->name, fi->id);
printf("NETLINK: Sending ctrl message\n");
nls->fi = fi;
send_netlink_control_msg(nls);
break;
}
else if (nlm->nlmsg_type == NLMSG_DONE)
{
printf("Parser got NLMSG_DONE, cnt=%d, remaining=%ld",
cnt, remaining);
break;
}
else if (nlm->nlmsg_type == NLMSG_ERROR)
{
printf("NETLINK: %d: parser got NLMSG_ERROR, cnt=%d, remaining=%ld, msg_type=%u",
nls->fd, cnt, remaining, nlm->nlmsg_type);
return -1;
}
if (res == 0)
cnt++;
}
return cnt;
}
/**
* Receive a message on a netlink socket. If the supplied buffer is not big enough,
* allocate a dynamic buffer and use that instead. Buffer pointers returned via variables.
*
* @param nls Pointer to the Netlink socket to read from
* @param msg Pointer to a msg header structure to read to
* @param buf Pointer to the buffer to copy data into
* @param bufsz Size of the provided buffer
* @param outbuf Pointer to a char* pointer to return the buffer used
* @param outbufsz Pointer to a size_t to store actual buffer length used
*
* @retval Number of bytes read, or -1 on error.
*/
static ssize_t netlink_recvmsg(netlink_sock_t *nls, struct msghdr *msg, char *buf,
size_t bufsz, char **outbuf, size_t *outbufsz)
{
ssize_t readlen;
/* Because we are not buffering the netlink messages, we need to be a bit smarter
* about the incoming recvbuf.
*
* Peek at the message to see how many bytes we have waiting. If we have more than
* our static buffer, use a dynamic buffer. Also use MSG_TRUNC, so we get ALL bytes. */
msg->msg_iov->iov_base = NULL;
msg->msg_iov->iov_len = 0;
readlen = recvmsg(nls->fd, msg, MSG_PEEK | MSG_TRUNC);
if (readlen <= 0)
return readlen;
/* Now choose a buffer. */
if (readlen <= bufsz)
{
msg->msg_iov->iov_base = buf;
msg->msg_iov->iov_len = bufsz;
if (outbuf != NULL)
*outbuf = buf;
if (outbufsz != NULL)
*outbufsz = bufsz;
}
else
{
char *newbuf = calloc(1, readlen);
msg->msg_iov->iov_base = newbuf;
msg->msg_iov->iov_len = readlen;
if (outbuf != NULL)
*outbuf = newbuf;
if (outbufsz != NULL)
*outbufsz = readlen;
}
/* Now do the proper recv. */
return recvmsg(nls->fde->fd, msg, 0);
}
void read_socket(fd, nls_opaque, status)
{
netlink_sock_t *nls = nls_opaque;
struct sockaddr_nl origin_nladdr;
struct iovec iov;
struct msghdr msg;
char readbuf[SOCKET_BUFFER_SIZE], *buf_ptr = NULL;
size_t buf_sz = 0;
ssize_t readlen;
int res, fatal = 0;
/* We need to prepare the message header to be written to. */
msg.msg_name = &origin_nladdr;
msg.msg_namelen = sizeof(origin_nladdr);
msg.msg_iov = &iov;
msg.msg_iovlen = 1;
while (1)
{
/* Read message. Note that we pass a static 'readbuf' to the recvmsg function,
* but we also pass a pointer to our *buf_ptr.
*
* If (msg.len <= sizeof(readbuf)):
* - buf_ptr will be set to point to readbuf.
* - buf_sz will be set to sizeof(readbuf).
*
* If (msg.len > sizeof(readbuf)):
* - buf_ptr will be set to point to a dynamically allocated buffer
* - buf_sz will be set to the size of the dynamically allocated buffer.
*/
readlen = netlink_recvmsg(nls, &msg, readbuf, sizeof(readbuf), &buf_ptr, &buf_sz);
if (readlen <= 0)
{
printf("Netlink recv error: %d: %s", fd, strerror(errno));
fatal = 1;
break;
}
res = netlink_parse_nlmsg(nls, buf_ptr, readlen);
/* Free any dynamic buffer we received from recvmsg.
* elided to keep code short
if (buf_ptr != readbuf)
{
}
*/
}
}
There’s an awful lot to unpack in this code fragment. The starting point is read_socket
— this listens to messages coming back from the kernel and then determines how to actually call recvmsg
. This is important as we cannot be sure just how large the message will actually be, so there’s an initial static buffer that we hope can fit the message but to be sure we check the length (using MSG_PEEK
) and dynamically allocate a new buffer if required.
Next we actually parse the message in netlink_parse_nlmsg
. This function works through the (potentially multiple) nlmsghdr
handling each type of message in turn. For our example of listening for a response from the kernel here, we only really care about NLMSG_MIN_TYPE
which signals a NETLINK_GENERIC
message.
The control passes through a series of helper functions to extract the information needed (again in our simplified usecase, we only care about the family
id, however there is much more that can be passed back in this message).
Finally we store the id retrieved from parsing the message in our netlink_sock
struct so we can access it later as we send our control message:
#define ETHTOOL_MSG_CHANNELS_SET 18
int send_netlink_control_msg(netlink_sock_t *nls)
{
struct sockaddr_nl kernel_nladdr;
struct iovec io;
struct msghdr msg;
struct genlmsghdr *gennlhdr;
char *ifname;
char buf[SOCKET_BUFFER_SIZE];
struct nlmsghdr *nlmsg;
/* The netlink message is destined to the kernel so nl_pid == 0. */
memset(&kernel_nladdr, 0, sizeof(kernel_nladdr));
kernel_nladdr.nl_family = AF_NETLINK;
kernel_nladdr.nl_groups = 0; /* unicast */
kernel_nladdr.nl_pid = 0;
/* put header */
int hlen = NL_ALIGN(sizeof(struct nlmsghdr));
nlmsg = buf;
memset(buf, 0, hlen);
nlmsg->nlmsg_len = hlen;
/* where msg_type is the family id the kernel provided */
nlmsg->nlmsg_type = nls->fi->id;
nlmsg->nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK;
nlmsg->nlmsg_seq = ++nls->last_seq;
/* extra header */
char *ptr = (char *)nlmsg + nlmsg->nlmsg_len;
size_t ehlen = NL_ALIGN(sizeof(struct genlmsghdr));
nlmsg->nlmsg_len += ehlen;
memset(ptr, 0, ehlen);
gennlhdr = (void *)ptr;
gennlhdr->cmd = ETHTOOL_MSG_CHANNELS_SET; /* the actual command */
gennlhdr->version = ETHTOOL_GENL_VERSION;
/* add fill header/nesting */
ifname = "eth0";
struct nlattr *nest;
nest = (void *)nlmsg + NL_ALIGN(nlmsg->nlmsg_len);
nest->nla_type = NLA_F_NESTED | ETHTOOL_A_CHANNELS_HEADER;
nlmsg->nlmsg_len += NL_ALIGN(sizeof(struct nlattr));
nl_attr_put(nlmsg, ETHTOOL_A_HEADER_DEV_NAME, strlen(ifname)+1, ifname);
nest->nla_len = ((void *)nlmsg + NL_ALIGN(nlmsg->nlmsg_len)) - (void *)nest;
uint32_t num_channels;
num_channels = 8; /* hardcoded for example */
nl_attr_put(nlmsg, ETHTOOL_A_CHANNELS_COMBINED_COUNT, sizeof(uint32_t), &num_channels);
/* Stick the request in an io vector */
io.iov_base = (void *)nlmsg;
io.iov_len = nlmsg->nlmsg_len;
/* Wrap it in a msg */
memset(&msg, 0, sizeof(msg));
msg.msg_iov = &io;
msg.msg_iovlen = 1;
msg.msg_name = (void *)&kernel_nladdr;
msg.msg_namelen = sizeof(kernel_nladdr);
/* Send it */
int res = sendmsg(nls->fd, &msg, 0);
printf("result of send: %d errno=%d : %s", res, errno, strerror(errno));
return 0;
}
Like the first message, we create a standard header nlmsghdr
and an additional genlmsghdr
(as this is a NETLINK_GENERIC
message). In nlmsghdr
we actually set the family
id we retrieved by parsing the kernel response:
/_ where msg_type is the family id the kernel provided _/
nlmsg->nlmsg_type = nls->fi->id;
In the genlmsghdr
we set the actual command we need to execute (in our case ETHTOOL_MSG_CHANNELS_SET
). To set the interface name "eth0"
we need to create a ‘nested attribute’, lines 43–48 handle this nesting. Finally we set the number of channels we wish to set (8) and the type (combined) in a further nlattr
then we construct the message, as before, and send it.
Conclusion
It’s fairly easy to see how much more complex the netlink version of the code is, in comparison to the legacy ioctl
method presented at first. This stems from the fact that the NETLINK_GENERIC
type is used for many different applications and as such there is an initial handshake with the kernel before you can send the control message. In isolation sending the control message to set the number of combined channels is not really that much more complex (although the nested attributes are easy to miss when first working this out).
As working with netlink directly is quite laborious, a couple of libraries have sprung up to make life easier for netlink application developers:
There are pros and cons to using either of these (or using neither library and working directly at a low-level). Libnl is the standard netlink application library and as such has good coverage in terms of examples and documentation, along with separate modules for generic, route and nf (netfilter). The big drawback of libnl is the size of the library which is something to be aware of.
Libmnl addresses this size issue at the cost of providing far less. Libmnl is “minimal” and focuses on things that are easy to get wrong: parsing messages, handling nlattr
etc.
For our experiments, we decided against the fully-featured libnl as it’s another large dependency to keep up to date & audit with respect to CVEs etc. Libmnl is closer to what we need, however as we don’t need all of it, we simply took the parts we found most useful and included them directly in our sources.
The other source of code that was useful when learning how to use generic netlink was the ethtool source code itself. Interestingly, ethtool relies on libmnl to deal with attr handling and message construction, which provides a set of example code for how libmnl can be used to build applications.