Simple networking for bare-metal kubernetes

One of the most important and also the least understood part of kubernetes is cluster networking. I guess you already read through that and now you’re a bit confused about how to implement the required model, considering you were presented with about ~30 different options. A bit more googling would reduce that number to a few more popular choices. Nevertheless container networking is an involved subject and it’s easy to get lost in the details of any specific solution.

In one of our solutions, what we needed was mentioned in the docs like this:

L2 networks and linux bridging

If you have a “dumb” L2 network, such as a simple switch in a “bare-metal” environment, you should be able to do something similar to the above GCE setup. Note that these instructions have only been tried very casually - it seems to work, but has not been thoroughly tested. If you use this technique and perfect the process, please let us know.

Yes, we had a “dumb” L2 network, (I mean, … who hasn’t) but that’s not only reason we needed this implementation. The applications running on the cluster were needed to be accessed from outside the cluster directly. Most of the networking solutions mentioned in the documentation on the other hand, are focused on the within-cluster communication. That’s natural since kubernetes is the go-to solution for micro-services oriented architectures, where small applications keep communicating with each other on the same cluster.

We shouldn’t interpret this like “internal and external networks must be isolated” though. While it’s possible and even suggested by a lot of tools to do so, it’s not entirely required, not in bare-metal environments anyway. It’s perfectly OK to use outside routable IPs for the pods running in the cluster, just not practical if the outside network happens to be the internet.

But how? Next line in the documentation goes:

Follow the “With Linux Bridge devices” section of this very nice tutorial from Lars Kellogg-Stedman.

Yes it’s a very nice one, a bit more towards docker-only environments though. And the 2018 update about macvlan driver certainly doesn’t help clear the confusion.

With CNI

So here is a simpler recipe

Note the subnet for your dumb L2 network (e.g. 10.10.0.0/16)

IP addresses for your nodes will be like

node1 10.10.1.0/16
node2 10.10.2.0/16
node3 10.10.3.0/16
node4 10.10.4.0/16
...
node100 10.10.100.0/16

Set IP forward for your nodes
```
sysctl -w net.ipv4.ip_forward=1
```
Configure your kubelets to use CNI (you probably already have this, as everyone uses CNI nowadays)
```
--network-plugin=cni --cni-conf-dir=/etc/cni/net.d --cni-bin-dir=/opt/cni/bin
```

Configure the bridge CNI plugin (you probably already installed the plugin too, as it’s one of the default ones. if not extract a release to /opt/cni/bin on all nodes.)

cat <<EOF > /etc/cni/net.d/10-bridge.conf
{
    "cniVersion": "0.3.1",
    "name": "data",
    "type": "bridge",
    "bridge": "cni0",
    "promiscMode": true,
    "ipam": {
        "type": "host-local",
        "ranges": [
            [
                {
                    "subnet": "10.10.0.0/16",
                    "rangeStart": "10.10.1.1",
                    "rangeEnd": "10.10.1.101",
                    "gateway": "10.10.1.0"
                }
            ]
        ],
        "routes": [
                { "dst": "0.0.0.0/0" }
        ]
    }
}
EOF

This instructs kubelet to utilize CNI for pod networking which causes our CNI compatible container runtime to use the configured plugin(s). This specific bridge plugin will create veth pairs for all of our pods to connect our pods to the same bridge. It will also configure the bridge we specify here, like setting the necessary promisc attribute for example.

IPAM (ip address management) is configured to be host-local meaning the IP addresses for pods running on this host, will be managed by the host itself, without any centralized component like DHCP. This example will set pod IPs in the 10.10.1.1 - 10.10.1.101 range, which means you can have 100 pods. Note that this range and the gateway will need to be changed for each node, like 10.10.2.1 - 10.10.2.101 and 10.10.2.0 for node2 and so on.

Final touches

The last step is connecting to the bare-metal network, simply by adding the actual network interface to this bridge. If it’s a single ethernet like eth1

sudo ip link set dev eth1 master cni0

If it’s a LACP bond you should add that instead. You also need to set your node’s IP address to the bridge itself, not the actual network interface.

sudo ip a add 10.10.1.0/16 dev cni0

If you do have any external networks (like 10.11.0.0, 10.12.0.0, …) there should be a gateway for your subnet to reach them (like 10.10.0.1). You should set it at your nodes as well. And of-course, use whatever your linux distribution provides, to keep your network configuration persistent over reboots (NetworkManager, ifupdown, systemd, netplan, etc.).

Happy hunting!