As more enterprises adopt Kubernetes, we see that the need for an enterprise-grade service mesh has become increasingly important. This brings many new enterprise networking requirements into the service mesh concept that originally spun out of web-scale application teams. This new set of requirements has brought the Kubernetes networking/CNI layer and the service mesh layer closer together and created demand for a new layer delivering a combination of the two while providing the following:
Well integrated into public cloud and on-prem:
Similar to Kubernetes, service mesh was primarily focused on deployments with infrastructure backed in the public cloud. With enterprises starting to consider adoption, the need for equivalent functionality on-prem and the ability to connect cloud and on-prem together is rising quickly. Equally important is the multi-cloud and multi-cluster aspect to provide connectivity, security, and observability across clouds and premises decoupled from the underlying infrastructure provider.
Operate at the network level:
Control of the network layer is not only key to integrating with existing enterprise networking components on-prem, but it is also essential to fulfill compliance requirements in the cloud regarding segmentation, encryption, and visibility. This includes providing functionality such as network policies, egress gateways, transparent encryption, BGP, SRv6, and integration with traditional firewalls.
Operate at the application protocol level
: Understanding of application protocols such as HTTP and gRPC is required to meet the requirements of modern application development principles by implementing functionality such as traffic management, canary rollouts, tracing, and L7 authorization. This is achieved by implementing standards like Ingress and Gateway API.
As Isovalent, we have created the highly successful CNCF project Cilium which has become the de-facto standard for cloud native networking and security. Cilium is powering infrastructure at major enterprises such as Adobe, Bell Canada, Capital One, and IKEA, a majority of managed Kubernetes platforms including products from Google Cloud and AWS, and is the default CNI in numerous Kubernetes distributions. With the introduction of Cilium Service Mesh, we are extending the capabilities on the application protocol level.
What is Service Mesh?
With the introduction of distributed applications, additional visibility, connectivity, and security requirements have surfaced. Application components communicate over untrusted networks across cloud and premises boundaries, load-balancing is required to understand application protocols, resiliency is becoming crucial, and security must evolve to a model where sender and receiver can authenticate each other’s identity. In the early days of distributed applications, these requirements were resolved by directly embedding the required logic into the applications. A service mesh extracts these features out of the application and offers them as part of the infrastructure for all applications to use and thus no longer requires to change each application.
Since its early days, Cilium has been well aligned with the service mesh concept by operating at both the networking and the application protocol layer to provide connectivity, load-balancing, security, and observability. For all network processing including protocols such as IP, TCP, and UDP, Cilium uses eBPF as the highly efficient in-kernel datapath. Protocols at the application layer such as HTTP, Kafka, gRPC, and DNS are parsed using a proxy such as Envoy. Lastly, for service mesh use cases that go beyond the capabilities of Cilium, Cilium is offering an Istio integration. It brings all Istio features to Cilium while allowing Cilium to enforce L7 policies via the Istio-managed sidecar. In parallel, Cilium automatically optimizes some aspects of Istio such as shortening the sidecar network path injection and avoiding unencrypted data exposure between application and the sidecar.
Kubernetes-native:
Our teams already know how to use Kubernetes. We want to consume service mesh features without learning many new concepts and to provide a Kubernetes-native user experience similar to how Cilium Cluster Mesh uses Kubernetes Services and NetworkPolicy to perform multi-cluster connectivity and security.
Reduce Complexity & Overhead:
The complexity and overhead impact of sidecars can be severe. Provide us with a simple datapath model that avoids overhead while supporting arbitrary networking protocols. Kelsey Hightower referred to this as “service mess”.
With this first release of Cilium Service Mesh, users now have the choice to run a service mesh with sidecars or without them. When to best use which model depends on various factors including overhead, resource management, failure domain, and security considerations. In fact, the trade-offs are quite similar to virtual machines and containers. VMs provide stricter isolation. Containers are lighter, able to share resources and offer fair distribution of the available resources. Because of this, containers typically increase deployment density, with the trade-off of additional security and resource management challenges. With Cilium Service Mesh, you have both options available in your platform and can even run a mix of the two.
Besides avoiding the sheer amount of proxies that need to be run in a sidecar model, a significant advantage of a sidecar-free model is that we can avoid the requirement to run two proxies in between any connection. This is made possible by using Cilium at the network/node level to encrypt and authenticate or use the upcoming new mTLS model which separates the authentication from the transport (More details in this blog:
Next-Generation Mutual Authentication with Cilium Service Mesh
)
Reducing the number of proxies in the network path and choosing the type of Envoy filter has a significant impact on performance. The above benchmark illustrates the latency cost of HTTP processing with a single Envoy proxy running the Cilium Envoy filter (brown) compared to a two-sidecar Envoy model running the Istio Envoy filter (blue). Yellow is the baseline latency with no proxy with no HTTP processing performed.
eBPF-Native When Possible
Besides the option to remove sidecars, Cilium Service Mesh can perform a variety of service mesh features directly in eBPF to reduce the overhead even further. When possible, the processing is performed in eBPF at a fraction of the cost. If eBPF is not capable of processing the request, for example when connections need to be spliced, requests need to be rate-limited, or TLS termination is required, the handling falls back to Envoy running in either a sidecar or sidecar-free model. This gives the best of both worlds – eBPF processing when possible for increased performance and reduced latency, with the ability always to fall back to Envoy as needed.
You see above an HTTP request/response benchmark measuring the P95 latency. It compares the impact on latency when running an eBPF- based HTTP/2 parser (brown), a sidecar approach (blue), compared to the baseline (yellow) which has no visibility enabled. The eBPF-based HTTP/2 parser is available in Isovalent Cilium Enterprise. The choice of sidecar proxy does not matter much (Envoy was used in this example) but the results were almost identical for other proxies that we tested, because the main cost stems from the injection of the proxy and the requirement to terminate connections and traverse the data between up and downstream.
What can be done in eBPF? When is a Proxy needed?
The table below lists the most common service mesh features and whether they need to be routed through a proxy running in either sidecar or sidecar-free mode:
To address the second big requirement of users to reduce the complexity and learning curve when adopting service mesh, Kubernetes has been exceptionally good at providing different abstractions at different levels of complexity, and Cilium Service Mesh allows users to do the same. We are extending the number of supported service mesh control planes in addition to the existing Istio integration to bring the new sidecar-free datapath option to existing service mesh standards.
This first release of Cilium Service Mesh includes a fully compliant Kubernetes Ingress Controller, enabling application teams to use L7 load-balancing and traffic management capabilities via the standardized Kubernetes Ingress resource. Kubernetes Ingress load-balancing can be applied for traffic into the cluster, within, and across clusters. (See
Getting Started with Kubernetes Ingress
)
Envoy Configuration CRD (New)
A new exciting Envoy Configuration CRD is available, making the entire Envoy proxy feature set available anywhere in the network. This enables users to write Envoy Configuration directly, and apply this anywhere in the network to enable advanced use cases that are not even covered by service meshes such as Istio.
Gateway API (Work In Progress)
We are hard at work to support the Kubernetes Gateway API standard as the next supported control plane. It brings additional capabilities on top of Kubernetes Ingress and is likely a feasible option for many application and platform teams as it strikes a good balance between capability and complexity.
SPIFFE (Coming Next)
Finally, the SPIFFE integration is already on the way to bringing service-specific certificates and thus proxy-based mTLS support.
mTLS for Any Network Protocol
By splitting the authentication handshake from the payload transport, we can use TLS 1.3 as the handshake protocol while relying on IPsec or WireGuard as a better-performing, more transparent payload channel:
Connections don’t need to be terminated anymore:
Whereas a sidecar-based approach requires to convert every TCP connection into a 3 way segment to inject TLS. The sidecar-free approach does not require terminating or manipulating connections.
No sidecars need to be injected:
No additional proxies need to be run; the authentication on behalf of the services can be performed by a single node agent. In the case of Cilium, this agent already exists and is aware of all required context. This simplifies management, improves the resource footprint, and improves scalability.
Support Non-TCP & Multicast:
While benefitting from the great properties of TLS 1.3 such as the low-latency handshake, TLS does not limit transport abilities. UDP, ICMP, and any other protocol that can be carried by IP is supported.
Support existing Identity & Certificate Management:
Any mTLS-based authentication control plane or identity management system can be plugged in and used to provide certificates for services. This includes SPIFFE, Vault, SMI, Istio, etc.
Handshake caching & Re-authentication:
The handshake can be done once, cached, and communication between authenticated services can happen without introducing additional latency for already authenticated service to service pairs. Even better, authentication can be done on a regular basis to re-authenticate services with each other on a regular basis.
eCHO Episode 32 – Hands-On with Cilium Service Mesh