Its obvious, Project Calico needs to communicate better

Its obvious, Project Calico needs to communicate better

Greg Ferro at Ethereal Mind just published a briefing on Project Calico. As always, it’s very interesting to see any write-up of your work, especially because it necessarily reflects how well you communicate your ideas!

In this case I think there’s some work to do, because there are a few areas of confusion in the above write-up that should be cleared up.

Since Greg made so many points and assertions, it might be easiest if we address each of the areas he touches on.  Let’s get started.

Encapsulation

This is the big one, and it is clearly Greg’s opinion, rather than a technical concern, so let’s address it first. Greg says a few things, but the main part of this paragraph is this:

I reject totally the assertion that overlay networking isn’t the best solution.

That’s a strong position, but I have to ask: the best solution to what?

Let’s be clear: if your requirement is to be able to pretend to a user that they have a virtual physical network running between their virtual machines, then overlays are unbeatable. Nothing else comes close. The Calico team has never contended that overlays aren’t the best fit to that use case.

However, if your users only need IP traffic, or if you value simplicity and scale, the case for overlays becomes a lot weaker. Overlays impose costs all across your data center: they impose per-packet handling costs, they limit the ability to efficiently use your underlay network, they require special-casing in network hardware, they make debugging hard, and they have poor performance at scale.

The big thing to ask yourself is what do your users need?  Do your users use IPX or Ethertalk, or do they just use IP? If it’s the later, why encapsulate it?  Calico has solutions for things like overlapping address space, but we don’t see that in most at-scale environments today, so why build the extra complexity into the base architecture, when it just increases the complexity of OA&M without bringing value.  If you do need it, Calico can do it, and ONLY the traffic that requires that treatment will be burdened with the extra complexity, not every connection (as in the overlay case).  Very often, network equipment vendors state that “an overlay is the answer.”  They very often don’t know what question they are answering, however.

To paraphrase Dr Malcolm in Jurassic Park: the network virtualization world has been so preoccupied with whether or not we could virtualize layer 2 networks and bring them into the cloud, we never stopped to consider whether we should.

Now that the philosophical disagreement has been addressed, let’s address some of the technical misconceptions.

Open vSwitch

Greg says:

[Calico is] an open-source project sponsored/owned by Metaswitch that promotes a model of programming Open vSwitch on Linux using BGP as an API.

This isn’t really true. Firstly, Calico does not program, or use Open vSwitch in any way.  In fact you can remove Open vSwitch in a Calico network, as its functions are completely bypassed. Instead, Calico programs the native routing function of  the Linux kernel. Secondly, we do not use BGP as an API, we use it as it was intended, a routing protocol that tells other Calico nodes (and the rest of the infrastructure) where workloads are at any given point in time.. Our API is in the form of an etcd data model. Our use of BGP should be totally opaque to anything trying to program Calico.

Proxy ARP

Next up, Greg says:

ARP hijacking does present operational risk to certain types traffic loads in the enterprise based on my experience.

Assuming that when Greg says “ARP hijacking” he means proxy ARP, there is no question that proxy ARP can present problems when used recklessly. We’ve covered some of this in our FAQ, but just to clarify: we don’t “proxy ARP” (we need to change the wording in our text), we send fixed ARP replies to an OpenStack managed VM using IPv4 for connectivity, to insure that all of its traffic is correctly captured by the compute server’s IP forwarding stack and not dropped on the floor.  We do NOT need to do this for any containerized IPv4 speakers, or any IPv6 speakers (VMs or containers) as those environments have other ways of insuring the same behavior..

The only risk proxy ARP then poses is that it only works for IP traffic. Given that Calico is only intended to work for IP traffic, that’s a totally reasonable limitation that does not affect the function of Calico.

Licensing

Another point that Greg makes, is:

Project is open source with an Apache license but Metaswitch is the project owner and controls contributions and some rather odd patent assertions that are onerous and need further investigation.

There’s a lot here, let’s unpack some.

Firstly, yes, Calico is licensed under the Apache 2.0 license. This is a perfectly standard license, there’s nothing special here.

When Greg talks about ‘rather odd patent assertions’ I presume he means the Contributor License Agreement.  The use of CLAs of this form is extremely common in open source projects. Ours is heavily inspired by the one used by the OpenStack Foundation, but both the Python Software Foundation (for CPython) and the Free Software Foundation (for GNU projects) have similar CLAs that have similar clauses.  That said, if members of the community think that the CLA is suboptimal, we welcome suggestions for improvement (just as we do for any part of the project, including code, documentation, etc.), so long as it still provides the protections for the contributors and users that are necessary for a project that is actually going to be used.

The Apache 2.0 license is different to other permissive open source licenses in part because it includes a so-called ‘patent clause’. This clause ensures that users can safely use Apache 2.0-licensed code without worrying that a patent lawsuit may be brought against them by the entity that contributed the code. If, for example, Metaswitch possessed patents that covered some of the Calico functionality, the Apache 2.0 license means that Metaswitch cannot sue you for patent infringement for using the Calico code. We consider this almost mandatory for any deployable open source project.

The Contributor License Agreement (CLA) is an extension on the standard contribution process, and it serves to provide an explicit legal agreement between Project Calico and its contributors. The agreement is that anyone contributing code to Project Calico agrees to assign the copyright on that code to Project Calico. Additionally, they grant Project Calico a license to any patent affecting the code they contribute.

The reason for these clauses is to ensure that the code in Project Calico is protected. When you download it and use it, you can be certain that you will not suffer a patent lawsuit from us or from any other contributor.

Security

Greg says:

The use of next hop addressing means that spoof attacks could be a practical attack vector. Calico does configure iptables on hosts but this doesn’t protect against spoofing.

The use of ACLs (in Linux they are called iptables rules) to enforce policy have always been vulnerable to IP address spoofing.  Quite some time ago (say about 20 years), the industry invented a technique called RPF (reverse path forwarding) as a way to remove that vector.  Since Calico does not want to reinvent the wheel, but leverage tools that have been in use for a long time, and are well understood, we have always implemented an RPF check in Calico’s base security model to address the vulnerability.

For those of you who are not familiar with RPF, it drops all traffic sourced from an address on an interface where there isn’t a route to that address that points down that interface.  An advantage of Calico is that the router that enforces this RPF check is directly adjacent to the node being filtered.  There is no way for the spoof to get through.

It is actually MORE secure to enforce a policy on the actual adjacency to a node that is referred to by that policy, rather than doing so at some removed point in the network.

However, defence in depth is important, and with settings like these it is always possible that the setting will be disabled or that the Linux kernel will have a bug, so we double up and also insert iptables rules to deal with this. Specifically, we add something like the following to the iptables chain that handles packets emitted by workloads (abridged for clarity):

Chain felix-from-a64e2126-8a (1 references)
pkts bytes target  prot opt in     out     source               destination
0     0 RETURN  all  —  any    any     10.0.0.10            anywhere             MAC FA:16:3E:2B:1E:5F
0     0 DROP    all  —  any    any     anywhere             anywhere             /* Default DROP if no match (endpoint a64e2126-8a88-4880-b482-4067b0593818): */

This set of rules is implemented through a slightly more complex set of checks that allow dynamically adding extra IPs and MAC addresses, but the net effect is the same. Essentially, iptables will only allow traffic through if it is from an explicitly whitelisted source IP address. Otherwise, we will drop the traffic.

This represents comprehensive defense against IP spoofing attacks.

Server Resource

On the topic of server resource, Greg says:

This might be solved but the resource costs of holding a large BGP table and metadata in the server OS needs research.

The same could be said about any server-based networking model.  We actually believe that we maintain less state (and therefore memory occupancy and CPU utilization) than models where every node in the network is a tunnel destination, rather than just an endpoint (less state to track).  The alternative is to push all of that state into dedicated hardware (the networking switches).  The cost of “at packet speed” memory on those nodes is substantially higher than the cost of memory on the server.  Anyone who has done infrastructure at scale should understand that state should be minimized, and what state there is should be horizontally distributed.  That has been, and continues to be one of Calico’s design goals.

Profiling how hard it is to keep a whole routing table in the OS is tricky, because measuring kernel memory usage isn’t something we do a lot of, but we have seen what the effect is on BIRD. In our tests, inserting 500,000 routes into BIRD causes BIRD to consume about 160 MB of memory. Given that most compute servers today are going to ship with at least 64GB of memory (and probably much more), this seems acceptable to us (and everyone we’ve talked to who is actually deploying at scale).

Of course, if the size of this worries you, then you can start performing intelligent tricks with BGP. That is one of the benefits of using BGP.  The industry has spent 20+ years developing a very rich, mature set of tools to manage routing table growth.  Its a toolset that far exceeds anything available for the encapsulated environments.

Data Center Only

Greg makes a statement that Calico can only run in a Linux VM environment

The Calico solution relies on Linux host running all VMs.

This isn’t quite correct.  In the current code, we support VMs as “workloads” in the Calico environment, as well as multiple flavors of containers.  However, the design also supports bare metal services, all you have to do is enable another interface type in the code (such as a “dummy” interface in Linux).  We are actively discussing that model with more than one potential user.  The design has been there from the start.

It should also be noted that Calico can in principle use any host OS that has packet forwarding and packet filtering function. Code changes will obviously be required, but the general principle transfers perfectly.  The only requirement is for the target environment to provide an API that lets the Calico agent program the forwarding tables and ACL mechanism.  We are actually in discussions with folks who are interested in porting Calico to things other than actual Linux compute servers, but we would have been insane to target anything other than Linux as our first platform.

The documentation discusses gateways and access to the main network but I could not easily establish how connectivity to external networks is managed. A data centre require seamless integration to WAN, Wireless or Internet/DMZ networks and must be seamless.

Of course a data center requires access to the network beyond the data center horizon, and, because we use standard IP/BGP networking, we are actually better at this than the encapsulation-based approaches.

If you use encapsulation to provide your virtual networking, you need a device that can on-ramp and off-ramp the encapsulation when the traffic enters and leaves your DC. This is a nice bottleneck for network traffic. To get Calico traffic into and out of a DC, all you need to do is announce the route to your border routers, exactly like you were already doing (only without the NAT).  Because we use standard IP/BGP techniques, Calico nodes can simply peer with your route reflectors and/or border routers, no onramp/offramp or protocol conversion is required.

Our documentation covers some of this, but generally speaking you can divide the data center IPs into two groups. Those that are in addresses private to the DC or those that are in addresses the DC publicly owns. If the workload has a private address, the data center gateway will need to NAT the traffic, as all IP gateways already do. For the others, the data center gateway simply needs to advertise the IP as being reachable inside the DC. All traffic will continue to flow normally.

There is no magic here in Calico, it’s just IP routing. All network operators are going to be familiar with how this works.

BGP

It is safe to say that, based on his post, Greg is fairly opinionated about BGP. If you distill his comments, his key concern is that “BGP is complicated”.

For those who have never actually operated a BGP network before, this is a commonly held misconception.  BGP the protocol, and the basic operation/configuration of the protocol is actually easier than an IGP (such as OSPF or IS-IS, and certainly easier than EIGRP for anyone still stuck in the dark ages).  However, BGP has lots of knobs, and lots of policy capabilities (think of them as BGP firewall behaviors).   They are there for folks who need them (say your ISP or some big backbone operator), but they are not necessary for a normal “stub” BGP network (which is what Calico is). In Calico we use BGP in an extremely straightforward way. Out of the box we don’t do anything exciting with BGP policy, which is the source of almost all BGP’s complexity. This means that Calico’s use of BGP is extremely easy to understand and debug.

As an example of how simple our use of BGP is, you can find our BIRD configuration template here. This template contains all of the configuration needed for BIRD in a Calico network, and it’s 54 lines long, of which 22 lines are comments and whitespace. It is hard to argue that BIRD is “hard to configure” when looking at that file.

Additionally, debugging BIRD is really easy too, because we don’t do anything clever. BIRD’s CLI debugging tool is perfect for our purposes, because almost all you need to see from BIRD is the logs (that it emits to syslog) and the routing table (which is one CLI command away).

Docker and VM Networking

Greg is very unimpressed with how we stack up against our competitors:

Docker has introduced native networking support. It works with lots of existing products. Its even an enabler for Project Calico but mostly people will be using Cisco / VMware/Nuage for running their data centers. There are dozens of SDN solutions that offer this feature, Calico is just another one with relatively limited features at low cost.

Let’s get the facts out of the way first:

  1. Yes, Docker has native networking support.
  2. Yes, it has a plugin architecture, so other SDN providers can use it.
  3. Yes, other SDN providers do use it.

There is another assertion here, and that is “well, everyone is already using Cisco/VMware/Nuage, and will continue doing so.”  That may be the view from where you are sitting, and it may be the one that those vendors would like to market, but it’s not the reality on the ground, as we see it.  Yes, those vendors (and others) have deployments (and will continue to do so, this is a big tent).  However, it is fairly safe to say that the folks who are really at scale are not using any of those vendors (go ask the tier-1 OOT folks who they use, for example).  Also, enterprises that are migrating from a virtualized data center to a “click-to-compute” or private cloud offering are broadly looking at what that environment may be.  Will it be VMware, OpenStack, or one of the container-based models?  Will it be overlay or native networking, will it be SAN or or object store?  Those questions are very much in the air.  If someone asserts that they “know” what the industry will look like, all I have to say is “you’re a better man than I am, Gunga Din.”

With that out of the way, the rest of this paragraph is two statements that boil down to this: “Calico is just a SDN with fewer features and low cost”.

The difference between Calico and Cisco/VMWare/Nuage is not just a feature list. The real difference is ideological. Those providers believe that more features is always better, and that all users should bear the complexity and scale costs of the features that only a few users need. Calico strongly disagrees. Calico believes that complexity costs should be borne by only those users who need the complexity.

We, like Alan Kay, believe that simple things should be simple, and complex things should be possible. Most SDN solutions make complex things possible at the cost of making simple things complex: we reject that approach.

Christopher is the original architect of Project Calico and one of the project's evangelists. In his day job, he's the director of solutions architecture at Metaswitch Networks. Prior to Calico/Metaswitch, he's designed and run some bio-informatics OpenStack clusters, done some SDN architecture work at Big Switch Networks, Run architecture at two large carriers (Telstra - AS1221, and Cable & Wireless/iMCI - AS3561) and been the IP CTO for Alcatel in Asia. He's also run networks in Antarctica (hint, bend radius becomes REALLY important at -50C), and been foolish enough to do a stint as a wg co-chair in the IETF. Occasionally you can have the (mis-)fortune of hearing him speak at conferences and the like.