The superpowers of eBPF in the networking stack

eBPF is the extended Berkeley Packet Filter, a simple virtual machine that has been added to the Linux kernel. This virtual machine implements a simple RISC-like assembly language that enables programmers to inject new code inside the kernel without changing it. This is much more powerful that the existing kernel modules or the Linux kernel probes. The use cases for eBPF evolved slowly since the inclusion of the first eBPF virtual machine in the Linux kernel. Over the years, new usages have emerged. It is now possible to use eBPF to collect performance statistics, configure sandboxes that control the utilisation of system calls, … eBPF can also play a very interesting role in the networking stack. As an illustration, this post summarizes four recent articles written by researchers of the IP Networking Lab at UCLouvain.

 

eBPF allows to efficiently instrument the TCP stack

One of the most frequent utilisations for eBPF is to monitor various performance parameters in the Linux kernel. The bcc software includes dozens of examples of monitoring applications, shedding light on various aspects of a running system. A nice feature of eBPF is that it enables programmers to attach eBPF bytecode at specific locations inside the kernel. With eBPF, it is possible to collect new statistics almost anywhere in the Linux kernel. In COP2: Continuously Observing Protocol Performance, Olivier Tilmans describes how eBPF allows to accurately monitor the TCP and Multipath TCP implementations in the Linux kernel. The paper provides lots of details about the new metrics that can be collected and some results from an initial deployment.  

By injecting eBPF probes well chosen locations in the TCP stack, and aggregating their measurements, enables to track the evolution of all connections. More specifically, its is possible to track the evolution of any (custom) state variable, and report selected changes to user-space.The figure below shows as example the probes enabling to report to user-space the occurrence of retransmission timeouts as an aggregated statistic.

Some of the collected performance metrics are described in the table below. They go beyond the classical TCP metrics that are exposed by tools like netstat, ss or the SNMP MIB.

A key benefit of using eBPF to instrument the TCP stack is that incurs a very small performance overhead. The figure below shows the overhead of the instrumented TCP stack (labelled Flowcorder) compared with the standard stack or a naive implementation.

The paper describes in more details the operation of these eBPF probes, their implementation and some use cases in a university network where the probes sent the collected statistics by using IPFIX to a collector. The source code for all the probes as well as RPM packages to run them easily are available from https://github.com/oliviertilmans/flowcorder

 

eBPF even allows to extend the TCP stack

TCP was designed to be extensible. A client can propose to use an extension over a given TCP connection by sending an option that identifies this extension during the three-way handshake, while a few other options such as RFC5482 can be sent directly without negotiation. That's the theory that all networking students learn in networking textbooks. In practice, deploying a TCP extension is much more difficult as the maintainers of client stacks often wait until servers implement a given extension and server maintainers look at clients in the same manner. It often takes several years to actually deploy an option at a large scale.

Thanks to eBPF, a different deployment model for TCP options is possible. This is illustrated in the figure below. If an application wants to use a specific TCP extension, it can inject the corresponding code inside the underlying TCP stack.

In "Beyond socket options: making the Linux TCP stack truly extensible", Viet-Hoang Tran demonstrates that this technique allows to deploy new TCP extensions. As an illustration, here are two of the TCP extensions discussed in the paper.

The Linux TCP implementation supports a wide range of congestion control schemes as pluggable modules. Most servers select one of these congestion controllers and use it for all their connections. The first TCP extension described in this paper is a simple TCP option that can be sent by a client to request the utilisation of a specific congestion control scheme by the server. As the TCP option is exchanged during the three-way handshake, it affects the entire connection. The figure below shows the impact of a specific congestion control scheme on the round-trip-time experienced by a long TCP connection.

Another example is a TCP option that is used by a client, e.g. a smartphone, during the three-way handshake, to bound the value of the initial congestion window on the server. This feature could be used to let a smartphone advertise the congestion window that servers should use in function of the network conditions (2G, 3G, 4G, …). Android smartphones already tune their TCP windows based on the current characteristics of the link layer, but this information is not communicated to the server. The figure below compares the page load time with different values of the option proposed by the client.

With the approach described in this paper, it becomes possible to innovate again in the Linux TCP stack. With such flexible TCP options, it could even be possible to deploy applications that perform A/B testing with the underlying TCP stack or adapt it to their needs.

eBPF enables deployable active networks

In the late nineties and early 2000s, Active networks were a hot topic within the networking community. Researchers proposed to extend the packet forwarding model used on the Internet by adding a virtual machine on each router and placing machine code inside each packet. Instead of simply forwarding packets, the routers had to execute the code contained in each packet and process it accordingly. Several use cases like multicast, or video transcoding were identified and some prototypes were built. However, no real deployment took place and this research domain became less popular.

IPv6 Segment Routing opens a new opportunity for active networks. Segment Routing is a modern variant of source routing. Two dataplanes are being finalised within the IETF: MPLS and IPv6 Segment Routing. The IPv6 Segment Routing dataplane provides  lots of flexibility thanks to its variable length header. It is supported by the Linux kernel since version 4.10. One of the use cases for IPv6 Segment Routing is the network programmability, i.e. the possibility of encoding inside each packet as set of actions that routers have to perform while forwarding the packet. Each of these routers that need to act on the packet are identified with an address inside the IPv6 Segment Routing header. Several types of per-packet processing are defined by the IETF and each router can associate a specific function with a given IPv6 address. In " Leveraging eBPF for programmable network functions with IPv6 Segment Routing", Mathieu Xhonneux and Fabien Duchene have proposed an efficient implementation of these functions in the Linux kernel. Their implementation exposes a new API that enables programmers to write eBPF code which can be executed on a per packet basis.

The paper describes two use cases with this modern and deployable variant of active networks: a monitoring application that collects one-way delay measurements and a method to efficiently load balance traffic over different paths. In a second paper " Flexible failure detection and fast reroute using eBPF and SRv6", it is demonstrated that the same eBPF infrastructure allows to implement a fast and efficient failure detection mechanism.

 

Bibliography

O. Tilmans, O. Bonaventure, COP2: Continuously Observing Protocol Performance, arXiv preprint,  https://arxiv.org/abs/1902.04280, 2019

Viet Hoang Tran, O. Bonaventure, Beyond socket options: making the Linux TCP stack truly extensible, arXiv preprint, https://arxiv.org/abs/1901.01863

M. Xhonneux, F. Duchene, O. Bonaventure, Leveraging eBPF for programmable network functions with IPv6 Segment Routing, CoNext 2018, https://arxiv.org/abs/1810.10247

M Xhonneux, O Bonaventure, Flexible failure detection and fast reroute using eBPF and SRv6, 2018 14th International Conference on Network and Service Management (CNSM), https://arxiv.org/abs/1810.10260

Published on March 21, 2019