Service Mesh Fundamentals: Theoretical Insights into HashiCorp Consul Connect

Cover Image for Service Mesh Fundamentals: Theoretical Insights into HashiCorp Consul Connect
Sandeep
Sandeep

What is a Service Mesh?

A service mesh is a low-latency infrastructure layer designed to manage high-volume communications between microservices. It is typically implemented as a set of network proxies, often referred to as a sidecar, deployed alongside the application code. This infrastructure layer enhances the application by adding observability, security, and reliability features.

In a service mesh, there are two main components: the data plane and the control plane. The data plane consists of proxies that handle request routing, encryption, observability, and additional security features like MTLS. The control plane is responsible for dynamically configuring these data planes. For example, in the HashiCorp Consul Connect service mesh, Envoy proxies form the data plane, while Consul servers act as the control plane.

Why Use a Service Mesh?

A service mesh offers features in four key areas:

  1. Security
  2. Observability
  3. Reliability
  4. Traffic Control

The main value of a service mesh is its ability to provide these features consistently across all services and workloads without requiring any modifications to the service code.

Observability

Observability measures how well you can understand a system’s internal states from its external outputs. Although issues in production systems are inevitable, the goal is to ensure that development teams have the data they need to identify and detect problems. With the surge in microservices adoption, it is essential that a system is observable for effective debugging and diagnostics. However, configuring all services to emit metrics and other data consistently can be challenging. This is where a Service Mesh excels, as it captures observability data by routing all requests through its proxies. The service mesh configures its proxies to emit metrics across all services in a consistent format without modifying or redeploying the underlying services.

Observability has three pillars:

  1. Metrics
  2. Traces
  3. Logs

Metrics

Metrics are numeric representations of data measured over periodic intervals. They can reveal the number of requests a service is handling per second, resource consumption (CPU, memory, etc.), and can help identify issues before they cause outages or aid in root cause analysis if they do occur.

Metrics are useful from an observability standpoint for three reasons:

  1. They can diagnose why something has failed. For example, if an application is responding slowly, its metrics might show that an upstream dependency has suddenly started taking five seconds per request.
  2. They can alert on potential issues before they become outages. For instance, if you know that your application degrades when it receives more than 1000 requests per second (rps), you could alert when it’s receiving 750 rps.
  3. They can be used to build dashboards. Dashboards provide a quick glance to understand a service’s health.

The mesh sidecar proxy hosts the Prometheus metrics endpoint at port 9102, which are scraped by the Service Mesh Prometheus. Service Mesh provides out-of-the-box dashboards for standard metrics for all services, which can be extended by respective teams. Additionally, metrics can be scraped by teams and stored in their Prometheus if required. Envoy in Service Mesh sidecar exposes hundreds of metrics, the full list of which is available in the Envoy documentation.

Service Graph

Visualizing services via the Service Dependency Graph is one of the most powerful features of a Service Mesh. This graph is powered by the centralized and uniform metrics across the services, which are automatically scraped by the mesh. Service Mesh scrapes the metrics of all instances of all services dynamically by default.

The graph provides a live and holistic view of inter-service communications, along with request count, errors, and average response time.

Tracing

Tracing is a method for collecting telemetry data by recording all microservice calls made as a result of a single request. For example, an API request to Service A might require an upstream call to Service B, which in turn invokes Service C. In case of an error or slowness in such a workflow, without tracing, the only option is to look at the individual logs and metrics from each service to narrow down the issue.

How Tracing Works

The first step in recording a trace is generating a unique request ID (typically done by the API Gateway or first service), which identifies the request across all microservices involved. The next step is to generate a span ID. Each service in the trace records one or more spans, which are units of work performed as part of the request with a start and end time. Each service has at least one span to indicate that it serviced a request and may have multiple spans if performing multiple pieces of work. For example, if a service makes a database call and then processes the results, it might have three spans: one for the overall call to the service, one for the database call, and one for processing.

When a service makes a request to other upstream services, it includes the request ID and its span ID as HTTP headers. When those upstream services make their own calls to other services, they also include the request ID and their span ID. Each service then emits its spans to the tracing collector, which correlates the spans into a single trace using the request ID and span IDs.

Logs

Logs are time-stamped records that provide software architects and developers with detailed information about the resources in use. Logs offer insights into what went wrong and what changed in the system’s behavior at that time. Logs contain structured or unstructured information (usually text) generated when the application is running. We usually examine logs when things go wrong, investigating them to analyze and debug the application’s code to identify issues or defects.

All the telemetry data Service Mesh collects about applications is emitted via metrics and traces. Service Mesh doesn’t add any features for logging.

Reliability

Failures are inevitable in distributed systems. Building reliable systems means minimizing failures where possible and handling them gracefully when they occur. Here are a few ways a system can reduce failures:

  1. Implement health checks so that traffic is only sent to healthy services.
  2. Retry failed requests to ensure they succeed despite transient issues.
  3. Implement timeouts to prevent a service from waiting indefinitely for a response.

Implementing these techniques in code can be time-consuming, error-prone, and difficult to do consistently across all services. With a service mesh, proxies can perform these techniques for any service—all you need to do is interact with the control plane. You can also adjust settings in real-time as service loads change.

Health Checks

Service Mesh supports both active and passive health checks.

Active Health Check

Active health checks perform actions at regular intervals to ensure that the service is healthy. For example, checking if a port is open or a specific process is running. In most SpringBoot applications, the “/actuator/health” endpoint is used to check health status. If the test is successful, the service is considered healthy; otherwise, it is marked as unhealthy.

Passive Health Check

Passive health checks rely on requests going to a service as part of its regular workload. Callers record all requests to a service, and if a certain number fail, the service is marked as unhealthy. For example, if 9 out of 10 requests to the backend service result in an HTTP 503 error, the service is marked as unhealthy.

Retries

Even with health checks configured, there’s still a chance that a request may fail due to random network issues or a service becoming unhealthy. The easiest way to handle such failures is to retry the request. Service Mesh supports configuring sidecar proxies to automatically retry requests, so from the perspective of the calling service, they’ve only sent one request.

Service Mesh allows services to configure retries based on status codes and dynamically change them via GitOps pipeline.

Timeouts

In a microservices architecture, there are many calls between services, and the calling service usually has to wait for a response before it can respond back to its caller. In these systems, timeouts are vital.

Timeouts solve the problem by failing the request after a defined period. Implementing timeouts in service code has the same issues as implementing telemetry and security in service code. Leveraging the service mesh for timeouts is more efficient.

Traffic Control

Traffic control involves managing where traffic between services is routed. It solves various problems, including:

  1. Implementing deployment strategies: Such as canary deployments, where a small amount of “canary” traffic is routed to the new version of a service to test it before a full rollout.
  2. Monolith to microservices migrations: Traffic previously routed to a monolith is seamlessly redirected to the new microservices.
  3. Multi-cluster failover: Traffic is routed to services in other healthy clusters if the local cluster is down.

Deployment Strategies

There are three primary deployment strategies supported in Mesh:

  1. Rolling
  2. Blue/Green
  3. Canary
Rolling Deployments

In a rolling deployment, older instances are gradually replaced with newer versions. Traffic is continuously routed equally across all running instances. Rolling deployments are the default strategy in Service Mesh.

Blue/Green Deployments

In a traditional blue/green deployment, there are two identical production environments: one labeled green, the other blue. Only one environment serves traffic at any time, while the other is either on standby or spun down. During deployment, the new version is fully deployed to the inactive environment, and traffic is then swapped over to it.

Canary Deployments

A canary deployment is similar to a blue/green deployment, but traffic is not switched all at once. Instead, a small amount of traffic is initially routed to the new version to ensure it works as expected before a full rollout.

Consul Connect

HashiCorp Consul Connect uses service resolver, splitter, and router config entries to enable finer-grained control for deployment strategies and advanced routing.

Service Resolvers

Service resolver config entries are part of L7 Traffic Management. They target different service versions by dividing instances into subsets. To identify which instances belong to a specific subset, service resolvers filter through instances to match an expression.

This can be used for various use cases, including:

  1. Routing traffic to a specific subset of a service (based on service metadata key/value). This can be used for implementing Blue/Green deployments.
  2. Exposing services in other datacenters as a service.
  3. Enabling failover to another datacenter. For example, if services in us-east-1 are down, traffic can automatically be routed to us-west-2.
Service Splitter

Service splitter controls how incoming Connect requests are split across different subsets of a single service (e.g., during staged canary rollouts) or across different services (e.g., during a v2 rewrite or other type of codebase migration).

Examples include:

  1. Splitting traffic between two subsets of the same service for implementing Canary deployments, where a small fraction of traffic is routed to the newer version.
  2. Splitting traffic between two subsets with extra headers added so clients can identify the version.
Service Router

Service Router matches requests based on their attributes and routes certain requests to specific subsets. It can also be used to set retries and timeouts. Service routers are flexible and can match requests based on HTTP or gRPC path, method, headers, or query strings.

Examples include:

  1. Routing all requests with a specific HTTP header to a new subset, while other requests go to the older subset of the service.
  2. Routing HTTP requests with a special URL parameter or header to a canary subset.
  3. Routing HTTP requests with paths starting with /admin to a different service.

Security

One of the primary reasons companies deploy service meshes is to secure their networks. This typically involves encrypting traffic between all workloads and implementing authentication and authorization.

In a microservices architecture, solving this problem without a service mesh can be very difficult. Encrypting every request requires provisioning Transport Layer Security (TLS) certificates to each service securely and managing your own certificate signing infrastructure. Authenticating and authorizing every request means updating and maintaining authentication code in every service.

A service mesh simplifies this process by issuing certificates and configuring sidecar proxies to encrypt traffic and perform authorization—all without changing the underlying services.

A service mesh can provide:

  1. Encryption of traffic between services (mTLS)
  2. Enforcement of rules about which services can communicate and what kinds of requests are allowed (Intentions)

These features are part of implementing a security model known as a zero-trust network.

Zero Trust Networking

Traditional network security architecture followed the castle and moat model. In this model, services are deployed into an internal private network (the castle) not connected to the public internet, while a firewall (the moat) secures access. Load balancers are deployed outside the private network and allowed access through the firewall. It is assumed that everything running inside the internal network can be trusted, so there's no need for encryption, authentication, or authorization between internal services.

In a zero-trust network, you assume the internal network is compromised. Services do not implicitly trust requests simply because they come from inside the internal network. Instead, services implement encryption, authentication, and authorization for all requests.

Encryption & Authentication

Encryption prevents man-in-the-middle attacks because attackers can’t read the data sent between two services. Only the destination service can decrypt the data.

Service Mesh provides each service with an identity encoded as a TLS certificate. This certificate is used to establish and accept connections to and from other services. The identity is encoded in the TLS certificate in compliance with the SPIFFE X.509 Identity Document, enabling services to establish and accept connections with other SPIFFE-compliant systems.

The client service verifies the destination service certificate against the public CA bundle, similar to a typical HTTPS web browser connection. Additionally, the client provides its own client certificate to show its identity to the destination service. If the connection handshake succeeds, the connection is encrypted and authorized.

The destination service verifies the client certificate against the public CA bundle. Consul has a built-in CA to generate and distribute certificates, requiring no other dependencies, and ships with built-in support for Vault. The PKI system is pluggable and can be extended to support any system by adding additional CA providers.

Authorization

Authorization determines whether an authenticated entity is allowed to perform a certain action, such as making requests to another service or accessing a specific HTTP path.

Service Mesh implements authorization via its intentions system. Intentions are rules governing which services are allowed to communicate.

Every intention has a source and destination. For example, an intention might allow a specific service frontend (source) to connect to a specific service backend (destination). Alternatively, a wildcard (*) can be used as the source or destination, allowing, for example, an ingress gateway to connect to any service, or any service to connect to any other service.

Depending on the protocol in use by the destination service, intentions can be either at networking layer 4 (e.g., TCP) or application layer 7 (e.g., HTTP):

  1. Identity-based: All intentions may enforce access based on identities encoded within TLS certificates. This allows for coarse all-or-nothing access control between pairs of services. These work with services using any protocol, as they only require awareness of the TLS handshake that wraps the opaque TCP connection. These can also be thought of as L4 intentions.
  2. Application-aware: Some intentions may additionally enforce access based on L7 request attributes in addition to connection identity. These may only be defined for services with a protocol that is HTTP-based. These can also be thought of as L7 intentions.

Summary

A service mesh is a low-latency infrastructure layer designed to manage high-volume communications between microservices. It adds observability, security, reliability, and traffic control features to applications without modifying service code. The service mesh consists of two main components: the data plane and the control plane. The data plane handles request routing, encryption, observability, and additional security features, while the control plane is responsible for dynamic configuration.

Key features of a service mesh include:

  1. Observability: Captures metrics, traces, and logs to provide insights into system behavior, aiding in effective debugging and diagnostics.
  2. Reliability: Implements health checks, retries, and timeouts to ensure high availability and resilience of services.
  3. Traffic Control: Manages traffic routing between services, supporting deployment strategies like rolling, blue/green, and canary deployments.
  4. Security: Secures network communications with mTLS, authenticates, and authorizes requests, implementing a zero-trust network model.

Service Mesh works alongside API Gateway, where the API Gateway manages North-South traffic and the service mesh handles East-West traffic, ensuring efficient and secure communication across microservices.