When you view a scale-out network through a 1990’s enterprise lens….

When you view a scale-out network through a 1990’s enterprise lens….

Lately, there has been a lot of discussion about Project Calico.  Most of that discussion gets what it is we are doing, by bringing basic Internet architectural approaches into the data center to scale and simplify the network plumbing.  However, when some look at the project, they are doing so through what I would term a “classical enterprise lens.”  That occasionally leads to a misunderstanding of the Calico approach, or the attempt to graft legacy enterprise networking models onto Calico.  The problem is that Calico comes from a very different heritage, and therefore, analyzing it against yesterday’s requirements and models may not always work.

These misunderstandings tend to focus around three key points and we will try and address them in the rest of this post.  However, there is another, more basic philosophical difference between an at-scale architectural approach, vs. a classical enterprise approach.

Some quick thoughts about the classical enterprise approach.  When the enterprise data center model was constructed, systems were long-lived and fairly self-contained.  They also had a lot of per-application specific requirements.  This lead to very customized infrastructure designs, or design templates that covered as many possible requirements as possible.  The tradeoff that was made was increased complexity and fragility of the infrastructure.  Since the demand was generally static, that was, potentially, a reasonable compromise.

With the advent of scale-out, or cloud-centric architectures, most of the design models we use today for infrastructure acknowledge that the real infrastructure requirements are more uniform (same protocols, similar storage models, similar compute models) but are much more dynamic and at much higher scale.  This has lead to  the “pets vs. cattle” analogy that is used so often today.  Unfortunately, the network is still being treated as pets.  The “accepted” approach is to try and make those same fragile, static design patterns that worked before, work in the new, more scalable, less variable environment.  Adding scale to a potentially fragile, static model is almost certain to introduce new “challenges” in the networking space.  In Project Calico, we have elected to take a more “cattle” approach to the network, in fact, borrowing the drivers from the rest of the cloud (simple design patterns, at scale, and highly dynamic) and married them to the one network model that we know to scale, the design of the public Internet.  We believe that this is a viable path forward for cloud/scale-out environments, and one that discards unnecessary complexity, in return for dynamism and scale.

The real takeaway message is that Calico does scale, and because of its architecture, does so better than an overlay approach.  But more importantly, it is VERY efficient for the general use case and ONLY increases the complexity for the workloads that actually need it.  Architecting for the 100% is relatively easy, assuming you don’t care about the complexity that will make operating it devilishly difficult.  The trick is to architect for the right subset that gets you the biggest “bang for the buck” without making the corner cases impossible to meet.  We believe we are close to that goal with Project Calico.

Most of the comments we hear regarding Calico involve three main areas.  We’ll try to address those areas now.

L3 vs. L2 and overlays vs. native

Some of the comments that we have heard in recent weeks include “Calico is going back to L2 for scale reasons” or “Calico uses microsegmentation” (which is a very L2 way of looking at an L3 routed infrastructure).  These, and similar thoughts are conflating two disjunct concepts:

  1. What is the forwarding quanta on the network.  On what bit of packet information is the forwarding system based.
  2. What is the interconnect fabric that stitches the forwarding nodes in the network together.

Let’s take these in order.

What is the majority forwarding quanta on the network?

For the vast majority of traffic in most scale-out or cloud infrastructures today, the packet that the application in generating is an IP packet.  I would challenge anyone to find a non-IP (or IP related, like ICMP) packet in their network today.  The days of IPX, NetBEUI, EtherTalk, Banyan Vines, ATM, and DECNet are long gone (DECNet may still live on is some deep dark corner of ESNet I guess).  When people say that we need L2 connectivity, that should really be rephrased as “We need private IP networks” or “I can’t be bothered with changing the design blueprint for this application from the 1990’s”.  Well, the 1990’s have called, and they want their physical choke-point, three-tier wedding cake network back.

Since IP is the quanta that we are using on the network, it only makes sense to use that as our forwarding model.  The reason reason that the IETF did the L2 over L3 encapsulation work (PWE3, L2VPN, etc) was exactly that, the world is dominated by L3, so why not have that be the base of the network, and encapsulate the outlying traffic (legacy L2).  What the L2 over L3 folks have forgotten was that that was supposed to be for the corner cases.  Does it really make sense to encapsulate IP over Ethernet over VXLAN over IP over Ethernet?  Is that efficient?  Is that any easier to troubleshoot?  If we really thought that Ethernet was the correct forwarding quanta, we should just build huge L2 networks – oh, right, we’ve tried that, and it didn’t work so well, did it.

One thing that falls out from using IP as the forwarding quanta is that the L2 concept of segmentation loses a lot of its meaning.  In an Ethernet network, two nodes are either adjacent on a given segment, or they aren’t.  If they are, they can forward to each other, if not, they can’t.  IP really doesn’t have a concept of segments (or even subnets, that’s a way of mapping an L2 segment to an IP address space).  In a pure IP network, routers forward to what is called the “longest prefix match.”  Basically IP addresses can be grouped on bit boundaries as prefixes (e.g. 192.0.2.0/24, 198.51.100.16/30, and 2001:db8:://128).  Those have no relation to the underlying physical topology other than to say all the addresses that match that pattern share a common “route” from that router.  So, if a router has a route that says 192.0.2.0/24 is down interface 1, and that 192.0.2.26/32 is down interface 2, then all traffic destined for anything in 192.0.2.0/24 will go down interface 1, unless it is destined for 192.0.2.26, which would go down interface 2.  This allows for all sorts of capabilities that are just not available in Ethernet networks, and makes the whole concept of “microsegmentation” really meaningless in an IP network.

Calico’s approach is dependent on some assumptions.  All (or almost all) traffic is IP, and that, going forward, IP addresses are not hard-coded, but that some form of service discovery is used (DNS, etc.).  Those two design models have been well accepted in the industry for at least 15 years now, so maybe we can let go of a 1990’s networking model.  Based on that model, the Calico team does not believe that we need to encapsulate the native IP traffic in some virtual L2 layer, just to re-encapsulate yet another network layer (VXLAN, NVGRE, etc) and finally encapsulate that whole mess in another IP packet.  Most of our conversations with people evaluating Project Calico, or discussions at conferences and meetups, etc. tend to bear that belief out.  Therefore, we don’t, and will not, use encapsulation as the primary transport mechanism in Calico, period.

How do you interconnect the nodes (internal fabric)?

The second point made by some folks who may not fully understand how routed networks work say that “Calico is going back to Ethernet for scale reasons.”  This is usually in relation to some documents on Project Calico’s web site that discuss physical, or fabric topology options.  Those documents can be found here and here.

In the annals of Internet architecture lore, you can find long-running religious battles about how a backbone operator should interconnect their routers.  Some folks used switching (first ATM, then either Ethernet or MPLS), others used routing (PPP over SDH).  The reason was not that some were more stupid than others, but that different operators had different requirements or constraints.  The beauty of an IP network is that you have an almost endless choice of how to interconnect your routers.  You can use direct connections to other routers (the edge-router, core-router concept) which, in a data center might look surprisingly like an L3 Clos network.  You may use switching, such as Ethernet or MPLS, the former most commonly is represented in the data center by an L2 Clos, and the latter by a certain large service provider based out of Redmond.  In fact, you could even use carrier pigeons (in fact, if anyone implements IP over Avian Carrier for Calico, there will be an awesome reward from the Calico team).

By using an IP forwarding design, and turning the compute hosts/servers/slaves into routers, we allow the infrastructure architect to make choices about how to interconnect those servers and isolate those decisions from what the applications see.  In short, we allow the infrastructure to be decoupled from the tenant applications.

Now just as in any engineering design, there are tradeoffs, and it is up to the infrastructure designers to weigh those tradeoffs and decide how to interconnect their Calico routers.  (To date most Calico users with large scale deployments are choosing L3 for their interconnect fabric using Trident II based ToRs. Without route aggregation this gives them up to 128k IPv4 workloads.  For cases where that’s not enough route aggregation coupled with longest prefix match gives even higher scale while still maintaining IP mobility.)

We have some thoughts on different interconnect approaches documented in the docs section of our website (as noted above), but just because we say that there are L2 and L3 ways of interconnecting the Calico nodes, and that those decisions may have an impact on route scale, does not mean that Calico is “going back to Ethernet.”  In all cases we forward on IP packets, no matter what architecture is used to interconnect the Calico routers.

The ‘private’ network issue

Another issue that we hear from time to time is that 1990’s networking hasn’t disappeared yet, and some “critical” application has hard-coded IP addresses, or that some company just bought another company, and both decided to use 10.1.1.1 for some key application, and it can’t be changed.  We acknowledge that the world is not a clean place, and that these things will happen, so how does Calico handle this.   First let’s look at how the rest of the world does it.

Encapsulate everything

Network vendors have never met an encapsulation they didn’t like… and Network operations staff have never met one that they do….  However, even though the operation of an overlay, or encapsulation network is harder, the answer from most quarters for this problem is to encapsulate EVERYTHING.  That doesn’t really make sense if you posit that the majority (or vast majority) of traffic doesn’t come from, or is destined to, overlapped IPv4 nodes.  If that’s not the case, you have MUCH bigger problems in your infrastructure (called 1:1 NAT and DNS ALG, NAT chains, and much pain).  If most of the traffic is well-behaved, why penalize it for the miscreants?  Why make operations deal with everything being encaped for the minority of traffic that doesn’t fit the pattern?

NAT is simpler?

The other issue here is NAT.  If I’m encapsulating everything, I need to NAT at the end of service.  There are some issues with NAT.  One is it is very state heavy. In almost all implementations of NAT, I must maintain a state table for EACH session or

flow that traverses that NAT appliance.  If that appliance goes down, all those sessions will be interrupted (and lost).  If that’s unacceptable, I need to replicate that state between multiple NAT nodes in packet (real) time.  For a large network, this becomes a substantial issue.  Furthermore, if I am using NAT, I will probably also need to tell lies in DNS (such that the same domain name resolves to different addresses on each side of the NAT service).  If I have multiple overlapping spaces, I have an n-way mapping that I must maintain.  Trying to troubleshoot an infrastructure like this means that I need to understand what I should be seeing at the source and destination and making sure I am actually seeing that.  There’s a reason why provisioning an application in a NAT/DNS-ALG cloud almost never works the first (or second, or third) time.  There is nothing simple about NAT.

There’s a tool for that (464-XLAT)

So, in Calico, we came up with a different way of handling this problem.  We use an IETF-standardized IPv6 transition mechanism called 464-XLAT.  Yes, it does mean that your underlying infrastructure requires IPv6 support, but almost any network equipment bought in the last 10 years supports IPv6 (remember, the 1990’s are calling for their network).  Furthermore, since we are officially out of IPv4 addresses almost everywhere in the world, you really should be enabling IPv6 in your scale-out cloud (ARIN in North America just rejected it’s first IPv4 request due to lack of available addresses).

Now that you’re done hyperventilating about the IPv6 requirement, let’s look at what this allows us to do.  IPv6’s address space is huge in comparison to IPv4’s.  It actually allows you to fully map the entire IPv4 address into the space reserved for a single subnet two million times, with space left over.  In Calico, we leverage that to give each “instance” of an overlapped IPv4 “tenant” it’s own IPv4 space, all two million addresses.  Each of those instances has a different IPv6 prefix prepended to the IPv4 address for both the source and destination of the packet.  We then translate the IPv4 packet to an IPv6 packet and put it on the wire.  At the receiving end, the process is reversed, and the applications have no idea that they were re-mapped into IPv6 and back.

Because we can encode all the IPv4 addresses into unique IPv6 addresses, with room to spare, we can do this algorithmically, rather than by recording state for each packet.  Let me repeat, in Calico you can support all the overlapping IPv4 you want, statelessly.  Remapping processes can die, and when they come back, traffic just keeps flowing.  No state, no state replication HA, it’s just a static addressing transformation.  A further benefit, is that once you know how the “instances” are encoded in the IPv6 address, operations can look at any packet on the wire and tell you exactly what the original source and destination IPv4 addresses were, and to what “instance” those addresses belong, without looking in NAT state tables on servers, without looking at tunnel mapping tables, just by looking at the addresses.

So, while (or because) Calico doesn’t treat the overlapped addresses as native traffic, it is actually easier to operate, maintain, and trouble-shoot the Calico approach over the “encapsulate all things” model and the NAT that comes along with it.

Those of you who are going to try and catch us out on “but what if I need to talk to a service that isn’t in my instance’s address space?” We have an answer for that as well.

If the service is provided by the fabric, we strongly encourage that those services be offered using IPv6.  If so, then a DC-SIIT model can be used, with the IPv4 node having no knowledge that it is talking to an IPv6 service.  If that is not possible, than stateful 464-XLAT can be used to access an IPv4 destination outside of the overlapped IPv4 instance.

De-conflating policy and connectivity

The last area of confusion we see is another 1990’s throwback.  In those dim, dark ages of networking, the only way to reasonably enforce policy on a large network was via single points of control, otherwise known as firewalls (or routers acting as a firewall).  This lead to what is now commonly understood as the “three-tier” model of application deployment.  Each tier (front-end, application, database) was isolated from each other layer by a physical firewall port, and all the nodes within a given layer were connected to the same firewall port.  This made it easy to audit (see what devices are plugged in where), and easy to administer (install the rules in the firewall).  At least, it was easy when we were talking about physical nodes.

Furthermore, in this model, we conflate policy and connectivity.  Connectivity is provided by the policy control points.  However, these are different concepts.  One is “how do I get there” the other is “am I allowed to get there.”  They are very different questions, that should not be conflated.  Unfortunately, in this model, they are.

The problem is that, for many people, this is still what they think is industry best practice.  Sometimes this is referred to as “isolation” or “service insertion.”  Let me remind you that we aren’t talking about physical servers in physical racks, with physical firewalls anymore (go answer the call from the 1990’s, I’ll wait…

Centralized SPOFs (I mean firewalls) are easier?

If we just port this model into the scale-out age, we have an interesting model.  If we take a classical three-tier model, then we have, say, a large number of application servers scattered all over the data center in virtual containers, which may be moving around as loads are shifted.  We also have a large number of web front-ends that are similarly configured.  In fact, some of those web front-ends may be on the same physical servers as some of the application nodes.  However, we have this virtual firewall container, that is, most certainly, somewhere else in the fabric (on some other host).  Now, when an application server needs to talk to a web front-end (that it may be adjacent to), it must wrap that IP packet up in an Ethernet packet, which then gets wrapped in a VXLAN packet, which is wrapped in an IP packet, and then put on the data center Ethernet network.  It is then shipped to the firewall where the Ethernet is stripped off, the outer IP and VXLAN are similarly stripped away, as is the inner Ethernet, leaving the IP packet.  The firewall examines the packet, decides to allow it (or not), and then re-wraps it all again, and sends it back to (potentially the same) host hosting the web front-end destination, where the packet is unwrapped yet again, and the IP packet is consumed by the web front-end application.  This doesn’t strike me as particularly efficient.

The next problem is that you will notice I said “the firewall.”  Yes, ladies and gentlemen, this is a single node offering that service, and therefore, is a Single Point of Failure (SPOF).  You CAN have multiples, but then each firewall needs to update every other firewall with all of it’s per-packet/flow state.  Firewall HA is an interesting problem, and beyond this article, but while it can be solved, it, in itself, is complex, and inherently fragile, and when it goes wrong, it really goes wrong.  Operations teams love it…

So, you’ve built a massively scalable, massively resilient three-tier app, and run all of the internal traffic through either a SPOF or a fragile HA cluster.  I hope I don’t need to point out the obvious.  I would also challenge you to find this architecture at any of the large web scale folks today, for good reason.

A side note here is that the three tier architecture may even be on its way out.  It worked when applications were silos, and applications didn’t need to cooperate (i.e. no one was interested in the data of a given application, other than the application itself).  Today application stacks are much more interconnected (more ‘east-west’ less ‘north-south’ in orientation).  In that model, the three tier approach breaks down and actively hampers development.  Maybe it’s time to throw that model back to the 90’s as well.

Scale out is (or should be) a horizontal, edge based activity

So, great, we’ve pointed out all the sins of the SPOF-based three tier model, what does Calico suggest instead.

It’s important to keep in mind that the key thing that drove the three tier model was that it was very painful to distribute security rules to lots of endpoints.  Outside of some small number of companies, the automation of network configuration was e-mail to the person that was going to key in the configuration.  So, the firewall vendors came along and offered a single place to (mis-)configure your security policy.

Since that time, we’ve come up with LOTS of ways of automating all kinds of configuration in scale-out environments.  If we hadn’t, we wouldn’t be having this conversation, as the whole cloud/web-scale model would be unsustainable.  We need ways of automating our cattle, and we have them (e.g. Ansible, Puppet, Chef, etcd/confd, and a myriad of others).  We can (and do) push configurations to 1000’s of endpoints at one time.  In Calico, we use our agent on each server to program routes and policies, and use BGP and etcd to push that data around to all the other Calico nodes.  We use those same mechanisms to enforce policy in Calico.

In Calico, we install and manage policy at the first and last “hop” in the Calico network.  This is arguably a more secure model than what is available via a centralized firewall, as in Calico we know what we are talking to (as it is directly connected, and can’t spoof as someone else) where as in the central model, we are removed from the actual end-point we are trying to apply policy to.  It’s a bit like letting someone in the door because you’ve seen them through the window, rather than having someone tell you they are the delivery man via an intercom.  We also can simplify the rules as each Calico node’s rule set only has to incorporate the policies for the specific end-points attached to that Calico node.  Furthermore, we don’t have the centralized SPOF problem, as the rules are only for local nodes.  An outage of the node will only affect the end-points hosted on it (which will also be down from the outage) rather than other, remote end-points as in the centralized firewall model.

In Calico, just as in many tools in the scale-out world, the policies are managed in one place (either directly in our etcd datastore, or in the orchestrator that we listen to, such as OpenStack, Docker’s libnetwork, etc).  We then distribute those rules, as appropriate to all the nodes that need them.  We adjust the rules and where they are installed as the infrastructure grows, shrinks, or migrates.  We have the advantages of a single firewall (one place to manage the rules), but none of the disadvantages (SPOFs, state sharing, hairpinning of traffic, etc.)

Lastly, we decouple the reachability and the policy.  We use BGP to distribute the topology of the network, telling every node how to get to every end-point in case two end-points need to communicate.  We use policy to decide if those two nodes should communicate, and if so, how.  If policy changes and two end-points should now communicate, whereas before they shouldn’t have, all we have to do is update policy, the reachability information does not change.  If later, they should be denied the ability to communicate, the policy is updated again, and again, the reachability doesn’t have to change.  As applications mutate and get “mashed up” in this brave new world, the flexibility of this model is a substantial benefit.

A quick comment about performance

We also hear, from time to time, concerns about Linux forwarding performance.  Let’s think about that for a minute.  If you have a server, and connect it to 10 GE link, you expect to get 10 GE (or close to it) out of that link, and guess what, you do.  The days of Linux only being able to push 1-2 Gbps of throughput through an interface are long behind us (if not, why are people buying 10GE cards as standard builds today).  In fact, I just recently had a friend tell me they were getting 97 Gbps of storage traffic between two nodes using 100 GE interfaces on standard servers.  Guess what, all this traffic is going through the Linux kernel today.  That 10GE server you have that can fill that port, it’s doing it through the Linux Kernel.  I find it baffling that some folks are stuck on the thought that Linux performance is horrid, when the reverse is so obviously true, if you just understand that in the standard case, anything coming out of your Linux box goes through the SAME forwarding path that Calico uses.  There’s a reason that people are putting multiple 10GE interfaces on servers today, and looking at 25GE, 40GE, and 50GE interfaces, and it’s not just to throw more money at vendors, its because it works.

A final thought (congrats, you made it through a very long post)

We’re half way through the second decade of the 21st century.  Maybe its time to stop evaluating networking through the lens of 1990’s enterprise architecture, let the 90’s bury their network in peace and quiet, and realize that the design patterns of the old will only perpetuate the pain that we are all trying desperately to disengage from.

 

Christopher is the original architect of Project Calico and one of the project's evangelists. In his day job, he's the director of solutions architecture at Metaswitch Networks. Prior to Calico/Metaswitch, he's designed and run some bio-informatics OpenStack clusters, done some SDN architecture work at Big Switch Networks, Run architecture at two large carriers (Telstra - AS1221, and Cable & Wireless/iMCI - AS3561) and been the IP CTO for Alcatel in Asia. He's also run networks in Antarctica (hint, bend radius becomes REALLY important at -50C), and been foolish enough to do a stint as a wg co-chair in the IETF. Occasionally you can have the (mis-)fortune of hearing him speak at conferences and the like.