In Calico v3.9, the Calico team introduced the capability to live migrate clusters using flannel networking to use Calico, without application downtime. In this blog, I’d like to talk about if this live migration is right for you, and how it works.
First, some background
In a Kubernetes cluster, flannel provides a number of different networking backends for getting traffic from one node to another. The most common backend uses VXLAN encapsulation for pod traffic between different cluster nodes. Flannel itself offers basic networking features but lacks a number of capabilities provided by Calico – like flexible IP address management and network policy.
Calico does, however, provide a VXLAN data plane for pods. So you can still use VXLAN networking as well as the extra capabilities that Calico offers, all in one package.
Should I migrate?
There are a few different reasons why you might want to consider migrating your clusters from flannel to Calico. Let’s walk through a couple of them.
You want network policy
Network policy is a key part of building security into your Kubernetes deployments, and something that flannel does not provide natively (Calico does).
You could run Calico in policy-only mode on top of flannel networking (also known as “canal”), but there are a few downsides to this.
Firstly, this is a choice you need to make at cluster creation time, which means if you didn’t create your cluster in this way to begin with, you will need to recreate the entire cluster – this time with Calico installed as well.
Secondly, running both flannel and Calico introduces an extra moving part. You likely want to simplify your configuration by running only Calico.
Finally, while canal provides some of Calico’s features on top of flannel, due to a fundamental difference in assumptions each project makes about cluster networking, it can’t enable the full Calico feature set. For example, its missing Calico’s IP address management features discussed below.
You want more flexible IP address management
Flannel’s networking implementation is strongly rooted in the use of host-local IPAM (IP address management) CNI plugin. This is a simple approach to managing how IP addresses are allocated in your cluster, but comes with a few limitations:
- Each node is pre-allocated with a CIDR at creation time. If the number of pods you ultimately run per-node exceeds the number of addresses available per node, you will need to recreate the cluster. If the number of pods is much smaller than the number of addresses available per node, you will have an inefficient use of your IP address space, which can lead to IP address exhaustion challenges if operating at scale.
- Since each node has a pre-allocated CIDR, pods must always have an address assigned based on the node it is running on. We know in the real world there are use-cases that demand allocation of IP addresses based on other attributes – for example, the pod’s namespace.
Migrating to Calico enables you to leverage Calico’s flexible IP address management, which solves these use-cases and more. For more information, see how to get started with IPAM.
How do I switch?
There are two ways to switch your cluster to use Calico:
- Create a new cluster using Calico and migrate existing workloads
- Perform a live migration on an existing cluster.
If you don’t care about downtime, or if you have the ability to migrate workloads from one cluster to another without downtime, then we recommend simply creating a new cluster using Calico and migrating your workloads. This is the easiest way to get started using Calico.
However, if you cannot move your production workloads to a new cluster, performing a live migration from flannel to Calico is for most users as simple as applying a new Kubernetes manifest.
How live migration works
The live migration is simple to use, but behind the scenes there are a number of things going on, all orchestrated by a purpose-built migration controller. Let’s talk about what’s going on in a bit more detail.
The migration controller has three main stages:
Pre-flight checks to make sure migration is possible
The first thing the controller does is check the cluster configuration to make sure a migration is possible. It looks to make sure flannel is configured properly and in a way that is compatible with Calico migration. For example, it asserts that the flannel VXLAN backend is in use.
Perform a rolling update of each node
This is where most of the action occurs. Once cleared for takeoff, the controller uses node labels to perform a controlled rolling update of each node in the cluster. For each node in the cluster, the migration controller will do the following:
Drain pods from the node and prevent scheduling of new pods.
Removes flannel and its configuration from the node.
Installs Calico on the node.
Re-enables pod scheduling to the node.
The migration controller makes sure to configure Calico so that nodes running Calico and nodes still running flannel will continue to work together, leaving the cluster fully operational throughout the migration.
Once complete, your cluster will be using solely Calico for networking. Calico’s IPAM will now manage pod IP allocations, and the full set of Calico features will be available to use.
Clean up old resources
At this stage, all the nodes are successfully running Calico, and flannel is no longer needed. The controller removes the flannel DaemonSet (which no longer controls any pods) and exits.