Looking for more podcasts? Tune in to the Salesforce Developer podcast to hear short and insightful stories for developers, from developers.
104. The Evolution of Service Meshes
Hosted by Robert Blumen, with guest Luke Kysow.
As microservices and container orchestration have grown in popularity, reusable layers of logic, such as authentication and rate limiting, have been pulled out into separate entities known as a service mesh. Luke Kysow, a software engineer at HashiCorp, covers their history and evolution.
Luke Kysow is a software engineer at HashiCorp, and he's in conversation with host Robert Blumen. The subject of their discussion is on the idea of a service mesh. As software architecture moved towards microservices, several reusable pieces of code needed to be configured for each application. On a macro scale, load balancers need to be configuring to control where packets are flowing; on a micro level, things like authorization and rate limiting for data access need to be set up for each application. This is where a service mesh came into being. As each microservice began to call out to each other, shared logic was taken out and placed into a separate layer. Now, every inbound and outbound connection--whether between services or from external clients--goes through the same service mesh layer.
Extracting common functionality out like this has several benefits. As containerization enables organizations to become more polyglot, service meshes provide the opportunity to write operational logic once, and reuse it everywhere, no matter the base application's language. Similarly, each application does not need to rely on its own bespoke dependency library for circuit breakers, rate limiting, authorization and so on. The service mesh provides a single place for the logic to be configured and everywhere. Service meshes can also be useful in metrics aggregation. If every packet of communication must traverse the service mesh layer, it becomes the de facto location to set up counters and gauges for actions that you're interested in, rather than having each application send out non-unique data.
Luke notes that while it's important for engineers to understand the value of a service mesh, it's just as important to know when such a layer will work for your application. It depends on how big your organization is, and the challenges you're trying to solve, but it's not an absolutely essential piece for every stack. Even a hybrid approach, where some logic is shared and some is unique to each microservice, can be of some benefit, without necessarily extracting everything out.
Links from this episode
- HashiCorp helps automate infrastructure configuration
- Consul, Linkerd, Istio, and Kuma are several open source components for service meshes and control planes
- Mastering Service Mesh: Enhance, secure, and observe cloud-native applications with Istio, Linkerd, and Consul by Anjaki Khatri and Vikram Khatri
- Consul service mesh
- What's a service mesh? And why do I need one? by William Morgan offers additional advice on when to use service meshes
- The Service Mesh: What Every Software Engineer Needs to Know about the World's Most Over-Hyped Technology
- What is a Service Mesh? - a blog from NGINX
Robert: Code[ish], this is Robert Blumen. I'm a DevOps engineer at Salesforce. I have with me Luke Kysow. Luke is a software engineer at HashiCorp, where he works on Consul product. Luke, welcome to Code[ish].
Luke: Thanks for having me.
Robert: Luke and I are going to be talking about service mesh. I think it will be easiest to understand service mesh if we talk about the history of it. How did service mesh emerge? What came before and how did service mesh emerge from what came before?
Luke: When we start way before, we have monolith and you have this monolith to microservices move. So before, when we had the monolith, maybe the early rumorings of service mesh were load balancers and firewalls. And basically you're routing infrastructure within whatever data center you are running in. And so I think the earliest concept of service mesh is probably around how do the packets route from your customers into your app and then back out to the customers? But it doesn't really like the idea of a service mesh back then.
Luke: That's where you started to see the first rumorings of the idea of a service mesh, where you have these services and they're calling each other. And now you need to do things like you need to set up retries, for example, where if the service call fails, you want to do it again. So we started to see some solutions around that. Netflix had a couple of libraries out where you would embed a library in your application, and that would automatically do a retry if your request failed. It might do something complex where it does an asymmetric retry or something like that.
Luke: As we moved from monolith to microservices and your routing architecture got more complex, you started to see solutions that try to solve this complexity. I think really service mesh got its real start with Kubernetes with container scheduling. Again, this is just out of now you have the ability to have much more complex topologies, much more complex architectures and services and they're running all over the place now. So you needed to have some tools that would help you manage the routing between all those different services.
Luke: Also with Kubernetes and Docker containers in general, you had the ability to be a more polyglot organization where you could run many different services in many different languages. So going the Netflix route, where you had a library, where you actually include a library with all your applications and they would control the routing and the retries and all the other stuff, that was getting a lot harder. Because you're using different languages and now you need to have a library that works for all the different languages.
Luke: What we started to see was the idea of where we have this proxy, this sidecar, this process that's running out of your application. It's not embedded in the same library, so it can be written in a different language. What it does, it sits in front of your application and all the traffic that's going into your application goes through that proxy, and all the traffic that's going out of your application goes through that proxy. This is the modern, where service mesh has evolved too. What that gives you now is you have these proxies running in front of all of your applications, now you can program what those proxies do from a centralized control plane.
Luke: You can manage retries from a centralized control plane, you don't have to write a library. It gets embedded into all these applications. You don't need to redeploy all the applications when you want to update the library. So that's the journey we've taken to service mesh, and there's a little bit of instances of it existing in architecture throughout all the different stages. But I think that's where we're at now, which is when folks think of what a service mesh is, it's basically proxies is running in front of all your applications, where you can program those proxies from a centralized control plane.
Robert: If I could summarize what I take away from that. When you have a micro service architecture, the need is service A has to talk to service B, and we're talking about RPC communication here. Then people realized over time there's a lot of commonality in how service A talks to service B, and how service B talks to service C, and so on. And naturally you want to abstract that commonality into some unitary abstraction, which started out being a library but the reason you mentioned about everybody's not using the same language, it made sense to factor that out into a standalone process?
Robert: Then now service A talks to a sidecar proxy of A, which talks to a sidecar proxy of service B, which talks to service B.
Robert: What is the communication from A, to the sidecar proxy of A look like?
Luke: When we think about it like in a container, this will be a local host communication. It's going to be running in the same networking stack, the proxy will be. So that's a really, really quick hop over there, probably on the order of microseconds. Then the proxy will be listening on a defined port. The request will go into that proxy and then based on the rules that you've set up in your control plane, which we'll address later, that packet will get routed to service B. But it might be, if the second request, for example, the proxy is going to remember that that was the second request and the first one failed. And so it may might make a delay and retry it in two seconds or something that. So the communication, that's how it works between the proxy and service A.
Robert: That sounds really great because if it's a localhost, I don't have to worry about DNS or packet loss or a ton of other things that could go wrong when I go over a real network.
Luke: Yes, absolutely. I mean, you're still going to have to worry about that between the real network between service B's proxy and service A's proxy, but the application doesn't really need to know about that. It can just be a very simple call. It doesn't even have to use SSL or anything, it can just be HTTP, plain text. And then the proxy could upgrade that to being a TLS or call or whatever.
Luke: So that is the idea, is where we were starting to build a lot of complex logic into the application around firewall policies, around authorization, around retries and all these other things. What we were finding was that everyone was trying to write this logic over and over again. It was kind of unnecessary because the whole point of the application was to focus on the business logic. So the idea is like, okay well, let's just load all this stuff into the proxy at once, control it from one place.
Robert: Pretty much every programming language is going to have an HTTP client, so if you're talking to the onboard sidecar with HTTP, then that is very simple to implement.
Robert: All right, now let's look at that next up, you're going from this sidecar of microservice A to the sidecar of microservice B, and that's where a lot of the complexity is. And you've talked about, you've got possibly retries, load balancing, what are some other complexities that are handled in that layer?
Luke: You could also handle authorization. So is service A allowed to talk to service B? This is usually done via mutual TLS, where the request is encrypted with a certificate that identifies that service A's making the call. Then when it lands over on service B, it can cryptographically verify that this is indeed service A that is calling me. That's a really big one, is the idea of these zero trust networks, where even though you're within your own data center and you're making these calls between service A and service B, someone might be listening to your call, so we're going to make this a request over SSL.
Luke: That's a big use case. You talked about load balancing and retries. Another big use case is a migration. So with the world of Kubernetes, what if I want to move this app to a completely different location? Rather than having to reconfigure all my applications to have that new URL or whatever, I can just make a single ... The applications don't need to change. I just make a change on my control plane, and we're actually pointing this traffic completely somewhere else.
Luke: Another use case is failover, maybe even between data centers. So you retry three times to this application in your local data center, but that's not working, that's failing. So now, what if we actually retry to our application running in a completely different data center, because we think that one might be up? There's more and more, myriad use cases. Traffic mirroring where the idea is you send the same traffic and you split it. You send the same traffic to two different applications so this could be something maybe like a test application and an application that's a new application at canary or whatever. And you were discarding all the requests, but you just want to get real production traffic in there.
Luke: Chaos engineering is another use case where you actually inject faults. So you make it so that that request fails for a certain percentage of the time so you can see how your application responds to failure.
Robert: I want to focus a little bit on the mTLS. That implies that this proxy has a capability of storing all the certificates it needs, both client and server certificates, and presenting the right certificate if it's a client. So is certificate management one of the capabilities of the sidecar?
Luke: Yes. Usually the sidecar is what we call the data plane. It doesn't usually have a lot of complex logic in it and basically it just gets configured by the control plane. The control plane will be the one that would be sending down the certificates to the proxy. They'd be living in memory on the proxy, but they'd be rotated by an API call or something that. So the proxy, it does have them because it needs them to be able to make those requests, but I wouldn't say it's managing them. It's just reading them dumbly from the control plane.
Robert: I am going to come back to control plane, but I want to focus a bit more on the details of the proxy to proxy. Is HTTP/2 the usual choice?
Luke: Between the proxies?
Robert: Yes, between the proxies.
Robert: Why is that?
Luke: Well, I'm not sure if it's able to upgrade an HTTP/1.1 request to HTTP/2. But if it can speak over HTTP/2, that is a lot better because I mean, it's just a lot more efficient protocol for HTTP requests so it can pipeline the requests across two connections. You can send multiple requests. And you don't have this problem that HTTP/1.1 had around head of line blocking where you couldn't process half the requests because you needed to hear from another part of the request. So I think you can pack a lot more bytes on the wire, a lot more efficiently and so that's why it'd be preferred between the proxies.
Robert: Where you were going when you were talking about all the different things it does, one of the ones I have in my list that you maybe touched on or maybe not is a circuit breaker pattern. Can you describe what that does and how that helps?
Luke: Yeah, that's the idea where ... We see this all the time with distributed systems where, say your application is running a bit slow so maybe your requests to that application times out. Well, now we have our magic service mesh so all we need to do is just retry that request. Let's retry it immediately. What actually often happens is that now that application that starts to be a little bit slow is suddenly getting hammered with three times as many requests, and so it's slowing down even more. Eventually what actually needs to happen for that application to recover is just all the requests need to stop to the application.
Luke: This is the idea of a circuit breaker where the proxy recognizes that, okay well, this application, it hasn't responded to any requests for the last couple of seconds. Let's just stop sending all requests to it, so the circuit is broken and we stop sending requests to it. Now the application that is calling, it has to deal now with the fact that the circuit's broken and so instead of maybe waiting for a timeout and having its request fail, it just fails immediately. You still have to deal with that failure from that one application but what it does is it gives potentially time for the other application to recover so it's not getting hit with so many requests. Also it'll show up on your monitoring, like, "Hey, service A has a circuit breaker open now and so it can't talk to service B." So that's a little bit easier to see on your monitoring as well.
Robert: Another item on my list is a publication of metrics about the traffic. Say a bit more about that.
Luke: One thing in a big distributed system is it's really hard and complex to know what's happening. One of the ways we combat that is through metrics, so I have a bunch of graphs that show me that service A is calling service B at a certain requests per second, and a certain number of requests are failing. How do we get those metrics?
Luke: Before service mesh, what you would do is you would have a library or you custom build something where on every request, you'd have some code that says, "Increment a counter that I'm making this request, and increment a gauge about how long the request is, how long it took. And increment another counter if that request failed." Again, you were writing all this code in your application, so just with the routing logic, we saw a lot of benefits there. It's like, okay, let's push all that out into the proxy and have the proxy be the one that's emitting all these metrics.
Luke: These proxies now contain all those metrics around a request from service A to service B, how many there were, how fast they were completed, the latency, any errors, et cetera, et cetera. So you don't need to write all those stats in your application again, the application can just be dumb, and it can just send those requests and not worry about recording those statistics. Now all of these proxies are emitting these metrics and they're emitting them all the same across all the proxies, even though the languages of the applications behind them, they might be different. So now you can aggregate all these statistics in your dashboards and see them across your infrastructure, how all the service to service calls are going, and identify any issues like that.
Robert: It simplifies gathering metrics because some things are easier to count when you centralized the logic.
Luke: Yes, absolutely, and the way to retrieve them is the same. You can imagine, at my old job, we had StatsD and Graphite were the ways you would emit metrics, but Prometheus is now a lot more popular. Imagine, in the old days, you would have to upgrade those applications and literally change the library inside them to say, "Okay, now I want you to admit Prometheus metrics," but with the proxies, you don't have to do any of those code changes. You can just make changes to them at the control plane layer.
Robert: You did mention load balancing and you talked about an earlier approach where you would have a load balancer that is a component of your network, or maybe a proxy layer that's on the server side. There's a centralized point where requests go and then it distributes them out. There is a subtle point here, which is load balancing will now be on the client side in a service mesh, is that right?
Luke: Yeah, absolutely. Previously you could either have ... Every application has a load balancer that lives outside of it and anyone who wants to talk to that application, they send a request directly to the load balancer. And then that load balancer is now going to be load balancing between those applications. But now in that world, that load balancer is gone and so that application, it's proxy is essentially that load balancer. So it's moved into the client, like you said, it's client side load balancing. Which saves you a hop, because you don't have to hit the load balancer and then the application behind it. You can also do a lot more complex load balancing at that level.
Luke: So in the old world where service A called service B through service B's load balancer, that load balancer lived in one place. Even though service B, its actual backing instances were living maybe in ... if we think about something like different availability zones ... they might be living in different availability zones. So now service A, when it calls service B, it can actually notice when it's doing its load balancing that, hey, I get a lot faster responses from this one instance of service B. I don't know why, but I'm just going to keep calling that one because I get a lot fast responses. Reality what may be happening is they are both living in the same availability zone and so the network's a lot faster between the two of them. So you do get some benefits when you move that load balancing to the client side
Robert: I could see one advantage of the load balancer being closer to the server is the load balancer would know something about all the traffic coming into service B. Let's suppose we had services X, Y, and Z that are all calling service B and they don't necessarily have any ability to coordinate among each other, is there some way that client side load balancing can compensate or adapt so that it's getting served by the most available units on the backend?
Luke: Yes, absolutely. There's the concept of passive and active health checking. Active health checking is that service A is actually making calls specifically asking, "Hey, what is the health of service B in all these different instances that I could be calling?" And then directing its traffic to only the healthy ones or the ones that were responding quicker. Then another idea is the idea of passive load balancing, where the requests were going to be going to service B anyway, so rather than making specific health check requests, let's just watch the responses for those requests and record them and notice how fast they're going and also if they're getting a lot of bad responses, 500s or something. And using that data to decide where I actually route to and to prefer routing to the healthiest incidents or the one that's responding the fastest.
Robert: How would service A find out about all the available instances that are able to serve service B?
Luke: That's the same kind of question about the certificates. That's where we go back to the proxy is kind of this dumb layer that just gets configured by the control plane. So that logic of what services exist and what their IPs are, or where they are, that all happens at the control plane. So the control plane, if we think about the concept of something like Kubernetes where you have a scheduler, the control plane would be talking to that scheduler and it would be asking, "Hey, where are all the services? What are their IP addresses?"
Luke: If it changes, the scheduler would be notifying the control plane and saying, "Hey, okay, that IP has moved over here or that one doesn't exist anymore," and then the control plane is responsible with configuring the proxies. So pushing or pulling, depending on how it's built, the application down to the proxies. There's a little bit of delay there obviously, and then it would tell the proxy, "Okay, this service has moved over here. It's got a new IP address or it doesn't exist anymore."
Robert: You mentioned control plane a few times. I've been promising to come back to that. Start off, let's define what it is and then we'll drill down more into that.
Luke: As a service mesh developer, how we think about things is this idea of a data plane and a control plane. In our example of service A to service B, our data plane is the request from service A to its proxy and then from that proxy to service B's proxy, and then to the actual instance of service B. So it's the data the applications are sending to each other through the proxies. Then the control plane is a little thing that sits above all of that, that's actually configuring those proxies and how they work. That's where the control plane is. The control plane is basically going to be configuring these proxies, telling them, "Use a certificate," or, "Hey, when they get a request for service B, route it to this IP address or this IP address." That's the concept of the control plane.
Robert: Is a control plane then a service or a set of containers that runs on the same network as the data plane?
Luke: Yeah. Typically if we think about something like a Kubernetes cluster, your data plane will still be your proxies that are running alongside your applications. The control plane will be a deployment, another application basically, that's running in there, that's talking to the proxies and is also talking to the scheduler. If we're looking at in the VMs or something, let's say you're using Consul, the proxies would be living on each of your VMs. Then the control plane would be running on another set of VMs as a new VM application that is talking to the proxies.
Robert: Roughly how much communication is there between the control plane and the data plane as compared to data plane communication?
Luke: In terms of actual bytes moved, I would imagine that the data plane is going to have way, way more communication between it. The communication from the control plane is what changed basically, so if you had an incredibly dynamic cluster where you had services being registered and de-registered very, very often, then the control plane is going to have to communicate to the proxy saying, "Hey, this application has moved," or whatever.
Luke: Maybe if it was incredibly dynamic, that communication could be really large, but it's not going to be communicating that much information. Do you know what I mean? It's going to be communicating the service name and the IP that it has. So in terms of bytes moved around, I would imagine that on almost all cases, the data plane is going to have much more traffic because that's real application traffic, services sending traffic to one another and that can be very large.
Robert: What is the communication between the control plane and the data plane? What does that look like?
Luke: I can speak to Consul service mesh. It's done using an RPC mechanism. Most service meshes will be using gRPC if they're using a proxy technology called Envoy. So that's what Consul uses, it uses gRPC to talk to Envoy. I think you've done another episode on gRPC, but for the listeners that haven't listened to, it's an RPC communication layer framework coming out of Google that allows for very, very quick communication. And also it's strongly typed, so between different languages, you don't have any issues with JSON decoding failing or whatever like that.
Robert: We've talked about how the data plane relies on the control plane for information, that it needs to function, like certificates and the list of IPs and so on. It seems to me there's got to be some kind of a careful bootstrap process where you're not kicking the can down one more level. How does the thing all get bootstrapped up to where the data plane knows where to find the control plane, or the other way around? Is it a pull or a push?
Luke: How things typically work is at some point, you need to know the location of the control plane. So when the proxy starts up, it needs to reach out to some address and be like, "Hey, where's my configuration coming from?" So in something like a scheduler like Kubernetes where we have this idea where we can actually inject something into the application before it runs. Say I'm scheduling service A, we can actually intercept that scheduling request and say, "Hey, I actually want you to add in a sidecar, add in this to this deployment," so when this application actually runs, it's going to have some extra code running alongside it.
Luke: At that point, the control plane would be adding that sidecar in and it would be part of this configuration that it added. It would be like, "Here's my address, because I know my address because I'm me," so that's where that bootstrapping would happen. But it does get tricky, is around identity and security there. So can any old proxy just join, say, "Hey, service mesh, I'm a proxy for service A," well, how does the control plane know that that's ... "Oh and give me my certificates so I can talk I'm service A." How does the control plane know that that proxy is actually allowed to be a proxy for service A?
Luke: That's where there's a protocol, it's called SPIFFE, S-P-I-F-F-E. What that happens is, if you're running on VMs, you can use the identity of the VM, maybe in the cloud or whatever, to testify like, "Hey, this is service A." And if you're running in Kubernetes, there's a concept of a service account, which is basically the identity of that service. So we would validate that when that proxy starts up, it would send its service account information and they would prove that it's service A. After that, the control plane would be okay with sending the certificates there.
Robert: I'm not familiar with Kubernetes, but the way I've seen this problem solved in many systems is through configuration management or provisioning tools like Terraform that know everything, all the certificates and the secrets, and are able to inject that into different types of configurations before they start. Is that another approach or is that possibly an approach of how it gets into Kubernetes in the first place?
Luke: Yeah, no. And I apologize for speaking so much about Kubernetes. I think that's where we find folks trying the service meshes first. Mostly because it's a dynamic environment and so you can configure things a lot more easily and experiment a lot more easily than on something like VMs, where you are literally provisioning a VM, and it takes a little longer to deploy it. But yeah, when we look at how it works on virtual machines, you could have your Ansible or whatever, your configuration, Terraform, putting these certificates onto the VM. But what we find is that we actually want the certificates to be very dynamic so we want them to be rotated every day or something that.
Luke: So what we tend to find is we'll provision the certificate that allows you to talk to the control plane and we'll provision the identity that proves that you are allowed to talk the control plane, so be it a special token or something like that, that proves you're allowed to talk to a control plane. But from then onwards, once that proxy has that data and it can talk to the control plane, and it can testify that it is service A for example, then we actually have the certificates being rotated a lot more quickly. So the control plane will actually be sending those down to the proxy and it'll be living in its memory and it'll be easily rotated.
Robert: If I understood that, in order to bootstrap, you need a minimal amount of information that proves who you are, and that has to come from outside of yourself. It gets injected by something and the data bank can use that to authenticate itself to the control plane. Then it can get more information from the control plane that it needs to do the things it's allowed to do. And if service A said, "I want to talk to service B and C," and the control plane doesn't know that service A is allowed to do that, it could deny that request for those certificates.
Luke: Yeah. What would usually happen in that case where you have service A, it's not allowed to talk to service C for example, would be either when it tries to talk through its proxy, its proxy would refuse that request to continue through to service C. Or what you would see is the proxy would allow it to reach service C but then service C would inspect that request and it would see that it's from service A. And service A isn't allowed to talk to service C, based on my rules that I'm getting a push from the control plane, so we see how it all connects. then at that point it would refuse the request with a 400 or something.
Robert: Looks like you could do a certain amount of security monitoring-
Luke: Absolutely, yeah.
Robert: ... with this architecture of detecting entities that are trying to do something they shouldn't be able to do.
Luke: And this comes back to the metrics thing, where you can imagine in the old days where each application would have to emit a metric saying, "Request that wasn't authorized," or something like that. But here you could actually have all the proxies emit that same metric and you could track that across your infrastructure and notice maybe some incursion or something happening.
Robert: We've mentioned Kubernetes, what are some of the popular open source offerings for either the proxy or the control plane server?
Luke: I mean, I will make a clarification, too. The Kubernetes is neither the control plane nor the data plane. It's just a platform upon which applications can be run and then often that's where a service mesh with its data plane and its control plane end up being used. But in terms of the data plane, there's a very, very popular project that's used for most of the service meshes and it's called Envoy.
Luke: This is an open-source project out of Lyft, you know like Uber Lyft? That's what they were using internally for their own homegrown service mesh. That was open-sourced and it's a really, really quick dynamic proxy that many of the service meshes are using for their data plane. Specifically Consul uses that, it still uses that. Then in terms of control planes, there's Istio and Consul and Kuma. There's a number of service meshes, I'm sure I'm missing some, there's Nginx and there's also a service mesh called Linkerd. And then Linkerd, for their data plane they're not actually using Envoy, they're using a proxy that they wrote themselves in Rust. So they're using their own proxy and obviously they have their own control plane as well.
Robert: So all of these are roughly adhering to this architecture pattern. What are some significant differences between some of the different tools?
Luke: One big difference from Linkerd's side is that they wrote their own proxy in Rust so they're not using Envoy. That lets them expose some capabilities that Envoy doesn't have. Then I think some other differences you look at is, if we look at Consul, many of these service meshes are run exclusively in Kubernetes so you have to run them on the Kubernetes scheduler. What that comes back to is when we talked about, you asked a question how does the service mesh know where the services are? Many of these service meshes are tied to Kubernetes so they are very good at asking, "Hey, Kubernetes, where are the services?"
Luke: But Consul is actually built from before Kubernetes even existed and so we're our own system for knowing where services are. Consul runs very well on VMs and other platforms where there's no Kubernetes to ask like, "Hey, where are all the services?" Instead, the services are registered with Consul itself so that's a big difference there. I think Nginx service meshes, I haven't used it personally, but I'm sure it's very well-suited to folks who were already using Nginx for their load balancing and ingress there. So each service mesh has their own niche and their own use case that they're focusing on.
Robert: Are all of these projects independent views on how the things should work or is there a lineage of generation one, and people learned from that? And then you have the second generation of tools, and each generation learned from the failings of the previous?
Luke: Yeah, I would say it first started out with this library that lived in each application. The Netflix one, I remember the name now, was Hystrix. So that's probably the first musings of this concept. And then Twitter, they had their own library and I think it's called Finagle, that was written in Java. So there's some Twitter engineers that left Twitter because they were thinking, okay, this doesn't make sense as a library. It makes sense more as a standalone data plane and control plane, and so that's Linkerd. Those are the original Linkerd folks, and they built this proxy called Linkerd but it was written off of Finagle so it was written in Java. It was really, really cool actually and it was the first time, and if you look for service mesh back in the very beginning, they were the first people who coined the term service mesh.
Luke: We actually tried out their proxy, but one of the problems with their architecture was that this was the exact same time that people were moving to Kubernetes. What you didn't want to do was run this massive Java proxy that took up minimum 256 megabytes of RAM, which isn't a lot, but it's a lot when you're running it next to your tiny little Python app that takes 20 megabytes of RAM, or whatever. So we went from the libraries, to this first idea of this proxy that lives outside of your application, but that one wasn't lightweight enough. It was too heavy.
Luke: Then what we saw as the evolution of suddenly this proxy called Envoy got released out of Lyft and a lot of service meshes were like, "Okay, cool. That's a really, really lightweight proxy that we can use because it's written in C++, it takes very, very small amounts of memory to run. We can run this next to our application. That's where Istio came out as the first service mesh there, where it was the first one using Envoy as its control plane and it was very specialized to run in Kubernetes.
Luke: Then what happened after that was that Linkerd, they rewrote their whole service mesh and they rewrote a brand new proxy that wasn't based on Java, it was based on Rust. And then they rewrote their control plane that worked on that proxy and they made it so it worked really well with Kubernetes. Then I think following that, Consul jumped into the mix where Consul had a lot of people that were running services everywhere. Consul knows about where all the surfaces are and everything, and so our customers were asking us, "Hey we're already running Consul that knows where all our services are and we're already building homegrown service meshes on top of Consul because that's where the registration information is. It knows where all the services are. Can you provide a service mesh yourselves?" So that's how Consul jumped into the mix there. Then I think there's been an explosion from then on of lots of folks looking in to service mesh.
Robert: Luke, we pretty much covered all my prepared material. Is there anything you'd like the listeners to know about this topic of service mesh that we haven't covered?
Luke: Yeah, I think taking off my service mesh engineer hat and putting on my operations engineer hat, it's really good for folks to know about the service mesh and you hear it talked about a lot. What I think is important to do is understand what it is and the concept and when it makes sense for your application. But also I would caution folks not to think that they need a service mesh right off the bat. I think depending on how big your organization is and the challenges you're trying to solve, it does make sense, but I wouldn't say it's something that you absolutely need to have in your stack. I think that there's a lot of hype behind it that makes folks think that maybe that is something that they absolutely have to have.
Luke: If you're running your application and you're doing everything fine and everything's working really well for you, you don't have that many problems with routing or metrics or you don't have a really, really dynamic environment, then I don't think there is a requirement to run a service mesh. But I think that the signals that you should look out for, the flags that you should look out for when you're thinking, "Well, maybe it does make sense to bring this in," is when you're doing a big migration somewhere. When you find that your applications are a lot more dynamic, because you have more developers and they're building more applications, different applications.
Luke: You're finding you have a lot of complexity between routing to these applications. You find that your operations team is having to talk to developers and say, "Hey, can you make this change to your application, to route to this different location?" And also if you find that you're bringing in a certificate management and you're having to manage all these certificates across all these different applications. I think at that point, then it behooves you to look into service mesh.
Robert: I'd to explore that a bit more. If you're starting with a very small and simple project, you probably have a monolith, you don't have micro services. The only case where you need a service mesh is where you've already identified that you have a need for microservices. So you're already at a certain level of complexity, microservices being an inherently more complex architecture than the monolith.
Robert: If I said I'm going to adopt microservices, but I'm not going to bring on a service mesh until I need it, then you're looking at either you wind end up building a whole bunch of stuff into your applications or libraries and having to rip it out, or having say unencrypted communication between your microservices from what used to be private to your monolith. Is there really a migration path to microservices that does not involve a service mesh, and then you add it later?
Luke: Well, I think there can be. For example, say you were running VMs for example, and you were adding microservices and you were running those in VMs, and communication between them wasn't encrypted. Now you're looking at moving to maybe a scheduler or something like Kubernetes. I think you can make the argument that the first thing you should do is get those applications running in Kubernetes without a service mesh. Because you're also going to be dealing with so much complexity involved in that migration in and of itself, with new deploy pipelines, new ways to operationalize those applications. You have to containerize them and everything. There's so much complexity going on there that what I would say is if you already weren't encrypted between those applications, that moving them into Kubernetes without that encryption, you're not loosening your security and then it might not make sense to bring in a service mesh right away. You should get that operationalized working well first, before you bring in a service mesh.
Luke: Now, if you are in the instance where you already had encryption between the services and you're looking at building 100 more of them over the next year, then absolutely. If there doesn't seem to be a real path between the two of them, you do have to bring in a service mesh to solve that problem. You can do it by yourself with statically provision certificates, but you're right, at that point, it doesn't seem to make as much sense.
Robert: To wrap things up, Luke, where can listeners find you on the internet?
Luke: They can find me on Twitter @L-K-Y-S-O-W, @lkysow and also on GitHub by the same handle. And then I'm working on Consul so they can see us over at consul.io where the service mesh that I work on is situated.
Robert: Thank you so much for speaking to Code[ish].
Luke: Thank you very much. It was great to be here.
A podcast brought to you by the developer advocate team at Heroku, exploring code, technology, tools, tips, and the life of the developer.
Lead DevOps Engineer, Salesforce
Robert Blumen is a dev ops engineer at Salesforce and podcast host for Code[ish] and for Software Engineering Radio.
More episodes from Code[ish]
Karan Gupta and Marcus Blankenship
How can applying the right technology choices at the right time impact your coding and business choices? Karan Gupta explains how practicing “pragmatic engineering” can have an oversized impact on business and business efficiency. →
The episode focuses on managing a certificate authority (CA) within an enterprise. The internal CA is compared on many points to PKI on the public internet. →
James Dong and Chris Castle
How much can a day of coding help others? James Dong created a platform to help small businesses impacted by the COVID-19 pandemic sell gift cards online. Learn how this platform, built on Heroku, provided a way for residents to support... →