..

exotic networking patterns for load balancing that you probably don't need

After sacrificing a few hours in the quest of finding the optimal bare-metal k8s setup for a project I am working on (more on that on a different blog post), I found myself jumping into the rabbit hole of some exotic networking patterns used in modern Kubernetes load balancing. According to a post from Dynatrace “a typical cluster running in the public cloud consists of 5 relatively small nodes with just 16 to 32 GB of memory each. In comparison, on-premises clusters have more and larger nodes: on average, 9 nodes with 32 to 64 GB of memory.” So, when I remembered that kube-proxy added support for IPVS starting version 1.8 and GA in 1.11 my secondary reaction was doubt. Probably my initial was indifference, because when k8s 1.8 was current, I didn’t know much about Kubernetes. Or networks. Or computers to be honest.

Despite my late blooming into the computing industry, I now know that there are probably a handful of organizations in the world that would benefit from lower computational complexity. Where I could see benefit for more people was the fact that IPVS was designed inherently as a load balancer so the availability of scheduling algorithms is broader. The randomized, equal-cost selection of iptables when redirecting traffic is, indeed, suboptimal.

A bit of iptables and IPVS history

Netfilter is the original Linux kernel’s packet processing system. The commonplace name iptables came from the command that is used to interact with it, and because of their extremely tight coupling, and iptables being one of the most used commands in network computing, I’ll refer to both as iptables, even when talking about the Netfilter framework. The architecture of iptables groups network packet processing rules into tables by function (e.g packet filtering, NAT, and other packet mangling), each of which have chains of processing rules that consist of matches.

graph LR
    A[tables] --> B[rule chains]
    B --> C[matches]

Examining it in reverse, the matches are used to determine which packets the rule will apply to and targets determine what will be done with the matching packets.

INPUT Chain Traversal

-p tcp --dport 22 (SSH)
-p tcp --dport 80 (HTTP)
Default Policy: ACCEPT
Packet → dport 80

The way kube-proxy leverages iptables is by attaching rules to the NAT PREROUTING chain to implement its load balancing. This is quite simple and uses what is now a very mature kernel feature, working in tandem with the vast majority of networking infrastructure software that also rely on iptables for filtering (think CNIs, firewalls, etc).

However, the way kube-proxy is forced to program the NAT PREROUTING chain is suboptimal. Nominally, it is an O(n) operation to traverse the chain, where n is the number of rules in the chain. What’s more, the chain grows linearly with the number of services (and subsequently the number of pods) in the cluster, leading to a linear increase in traversal time.


IPVS managed to evade the loss of its original name by its user-space utility, ipvsadm. It was merged relatively late into the Linux kernel, entering mainline in 2.4.x. Designed specifically for load balancing, IPVS maintains a conntrack-like table of virtual connections and uses standard load balancing scheduling algorithms, offering three forwarding modes: NAT, direct routing and tunneling.

When a packet is received at the interface where a virtual service (VIP:port) is configured, if it belongs to that connection, it is forwarded to the stored backend. Otherwise, it picks a backend based on the scheduling algorithm and creates a new connection entry in the conntrack table so future packets stay pinned.

IPVS Load Balancing: Round Robin (RR)

LVS Director — VIP 192.168.1.10:80

Scheduler: Round Robin (RR)

Real Server 1

192.168.1.11

Real Server 2

192.168.1.12

Client Request

Rather than a list of sequential rules, it offers an optimized API and an optimized lookup routine. The result in kube-proxy is a nominal computation complexity of O(1). In most scenarios, its connection processing performance stays constant independent of the number of services in the cluster. One of the potential downsides if you’re operating IPVS within a Kubernetes cluster is that it requires investigation as to whether it will behave as expected together with tools that rely on iptables packet filtering.

In the wild

So, nominally kube-proxy’s connection processing performance is better in IPVS mode than in iptables mode. In practice, there are two key attributes you will likely care about when it comes to the performance of kube-proxy:

  • CPU usage: how does your host where your pods are scheduled perform, including userspace and kernel/system usage, across all the processes needed to support your microservices stack, including kube-proxy?
  • Round-trip time: when your microservices call each other, how long does it take on average for the to send and receive requests and responses?

The best way to test this is launch a load generator client microservice pod on a dedicated node generating around 1000 requests per second to a Kubernetes service backend. Scaling up to 100,000 service backends and running the load tests on repeat, I was able to paint a picture of the performance of kube-proxy both in IPVS and iptables mode.

Kubernetes Service Routing Performance at Scale

1000

10 endpoints per service • 1000 rps per client node

IPTABLES Mode

Sequential rule chain evaluation. Cost grows with every additional service.

IPVS Mode

Constant-time lookup through hashed connection tables.

When considering round-trip response time it’s important to note that the difference between connections and requests is persistency. Most of the time, microservices will use “keepalive” connections, where each connection is reused for multiple requests. This is important because most new connections require a three-way handshake (SYN, SYN-ACK, ACK), which in turn requires more processing within the kernel networking stack.

Nginx is everyone’s goto for simulating networking applications so we used it and its default keepalive configuration to get a ratio of round-trip response time vs number of connections. The default max keepalive connections is 100, so we used that as our upper bound.

Connection Overhead: Keepalive vs New

Client
Server
SYN
SYN-ACK
ACK
REQ 1
RESP 1
REQ 2
RESP 2
Waiting to start...

As you can see, negotiating a new connection every time severely penalizes your response times. The more services you have scaling up, the more connections are dropping and reforming.

Why Keepalive Matters (skip if you are overly familiar with TCP)

To understand this overhead, we have to distinguish between a request and a connection. When a client (like another microservice) wants to talk to a backend, it first has to establish a TCP connection via a three-way handshake:

  1. SYN: Client asks to sync.
  2. SYN-ACK: Server acknowledges and asks to sync back.
  3. ACK: Client acknowledges the server.

Only after this dance can the actual HTTP data be transmitted. If we close the connection after every single request, we pay this 3-way latency tax over and over again. By utilizing HTTP Keep-Alive (or persistent connections), a single TCP connection is kept open and reused for multiple subsequent requests. In the context of kube-proxy, iptables needs to evaluate rules for new connections. Once a connection is established, Netfilter’s conntrack (connection tracking) module remembers it, and subsequent packets for that connection bypass the heavy rule traversal. This means if your microservices are properly pooling and keeping connections alive, the theoretical O(n) penalty of iptables is paid far less frequently than you might think.

iptables bottleneck math

We have a mechanism in IPVS that clearly boasts superior lookup mathematics (O(1) vs O(n)). If it were a problem of mathematics (and in computing, surprisingly, it’s not always the case), we’d be looking at a cluster with 1,000 to 5,000 services before we’d see a real issue. And, the choice for a network layer would be as simple as comparing two numbers.

In computer science, O(n) means the time taken grows linearly with the number of items. For iptables, n is the number of services (and their associated endpoints). If you have 10 services, traversing the rule chain is instantaneous. If you have 10,000 services, kube-proxy has programmed tens of thousands of iptables rules. When a new packet arrives, the Linux kernel must potentially evaluate it against all of those rules sequentially until it finds a match.

O(1), on the other hand, means constant time. No matter if you have 10 services or 100,000, IPVS uses a hash table to find the correct backend route instantly in a single mathematical step.

If you remember from earlier, a statistically typical cluster holds only a handful of nodes and realistically less than 1,000 services. The fact of the matter is that iptables processing only truly starts failing the latency sniff test once you surpass the 1,000 to 5,000 service mark. Below those numbers, the CPU processing difference between O(n) and O(1) in the kernel is practically unnoticeable in the face of network I/O overhead. We are debating over single-digit millisecond or even microsecond latency differences.

What if you didn’t have to choose?

Even when you are facing those massive cluster topologies where the O(n) math becomes a real bottleneck, replacing kube-proxy’s backend with IPVS isn’t the magic wand infrastructure people want. But, not to say that there isn’t one.

eBPF is an interesting kernel runtime that allows you to run sandboxed programs directly within the Linux kernel space without having to change kernel source code or load kernel modules. Instead of relying on the Netfilter stack to process packets at the network layer, eBPF allows tools like Cilium to attach routing logic directly to the socket layer or early in the network interface controller (NIC) packet pipeline.

To put it simply, networking, and specifically in Kubernetes, can be thought of as a mail sorting system inside a city. When traffic comes into a cluster, something has to decide which pod should receive it. Traditionally, this job is done by the Netfilter stack via kube-proxy. With eBPF, instead of relying on an external application, a program is attached to the kernel that can make routing decisions directly on the packet level.

By the time a packet even reaches the Netfilter stack where iptables and IPVS live, eBPF has already routed it directly to the correct pod. It radically bypasses the entire traditional networking stack, blowing the performance and observability of both IPVS and iptables completely out of the water.

Packet Routing Race: iptables vs IPVS vs eBPF

iptables
Pod
Netfilter Stack
Rules O(n)
NIC
IPVS
Pod
Netfilter Stack
Hash O(1)
NIC
eBPF
Pod
eBPF Hook
NIC

The 1% problem

Essentially, the deprecation of iptables in favor of more advanced implementations like IPVS or eBPF proxies feels like it was designed to cater to the top 1% of Kubernetes users—the hyperscalers, the massive multi-tenant SaaS providers, and the heavily fragmented microservice jungles. For the vast, overwhelming majority of teams running a dozen or so services on a 5-node cluster, migrating the networking layer brings absolutely no tangible real-world benefit and simply adds operational complexity to a stack that is already a jigsaw puzzle blindfolded.

But, if your workload is small enough that the algorithmic complexity of sequential IP filtering is completely irrelevant to you… should you even be running Kubernetes at all?

But I think we’ll leave that philosophical crisis for another day.