Kelsey Evans: All right, hello everyone and welcome to day four, the final day of the microservices practitioner virtual summit. I’m here with Matt Klein who is going to talk about Deploying Envoy at Lyft. Matt is joining us again. He spoke at our in-person summit in San Francisco this past January about Envoy, so we’re excited to have him back. Just to walk through your zoom features one more time. There’s a Q and A button as well as a chat button on the bottom of your zoom screen that you can use to ask questions throughout the presentation. Feel free to type those as Matt is talking and we’ll get to them at the end. We’ll save about 10 minutes or so for Q and A. So with that, Matt if you want to get started, we’re happy to hear it.
Matt Klein: Cool. Thanks everyone. This is my first web presentation, so this is pretty exciting. Today, let me see if I can get this software working, yeah, okay. Today we’re actually going to be talking about how we actually deploy Envoy at Lyft. I’ve done a lot of presentations over the last six to nine to twelve months about what Envoy is and kind of what the whole service mesh technology space is and I’ve gotten a lot of questions about the actual mechanics of how we got Envoy working at Lyft. This presentation is going to be a little more detailed about what we actually did here, which I think folks will find pretty interesting.
This presentation assumes that most people are already somewhat familiar with both Envoy as well as the entire service mesh paradigm but I’m going to spend just a couple of minutes just to give a quick Envoy refresher. If you haven’t heard of Envoy before and if you haven’t heard of service mesh, it’s an idea where you have an out of process proxy that accepts all of the application traffic. It basically makes the network transparent to applications. In this diagram that you see here what we’re showing is we have a couple of different service clusters, we have a couple of different services.
Alongside every service is a site car proxy or an Envoy proxy and the services aren’t actually aware of the typology of the underlying network. The services only talk to their Envoy proxy on local host and they only receive responses from that local proxy on local host. To the service perspective, the network doesn’t really exist. The process that’s sitting beside it actually does all the load balancing, rate limiting, retry circuit breaking, all of those kinds of things and obviously all of the Envoys have to find all the other envoys, so they’re working through some type of discovery system and that’s kind of the most basic view of the service mesh concept.
Just kind of a very, very quick refresher on Envoy features. Like I was saying before, Envoy is an out of process architecture. The idea is that instead of actually having to write all these complicated networking things in every application language or framework, we’re going to do it all in one place, some are going to put that next to the application. Envoy is written in C++11, so the performance is quite good. It’s also very productive. Envoy at its core is a L3, L4 proxy. What that means is that at its base level it’s a byte proxy, so obviously lots of proxies are used for resting kind of HTTP processing but there’s a lot of other things that we can proxy whether that be MongoDB or redis or stunnel or just basic TCP rate limiting. There’s all types of different things that people want to do.
At its core Envoy is able of just passing bytes back and forth. Obviously like I was saying before, large portion of the modern internet is actually based on rest, so Envoy has a huge amount of processing in that area, so we also allow filtering at that level. Envoy is a H2 proxy first. That means that it can proxy in both directions from H1 to H2 and from H2 to H1. That means that it works very well with gRPC. Envoy does a lot of service discovery as well as active and passive health checking. Active health checking is when Envoy will send out-of-band pings to host to figure out if they’re up.
Passive health checking is when it will do in line health checking, so basically saying, if a host return say three five hundreds in a row Envoy will basically boot that host. Do a lot of advance load balancing. These are things like retries, timeouts, circuit breaking, rate limiting, shadowing, outlier detection. You’ll hear me talk about it a lot, but it really comes down to observability. That’s going to be stats, logging as well as tracing and Envoy is also usable as an edge proxy. That’s going to be edge routing as well as TLS.
This is Lyft from four years ago, now probably almost five years ago, very simple diagram. We have obviously clients, Internet, Amazon load balancer, PHP Apache monolith as well as our database MongoDB. This is the no micro service architecture. It’s supposedly very simple but at this time we’re already seeing a bunch of problems. We already don’t have consistent observability, we’re already actually having a small distributed system in terms of how this components actually talk to each other and we’re already having operational problems trying to figure out what’s actually going on.
Now, fast forward to today. I’ve obviously skipped a lot between those four years or four or five years. This is what Lyft looks like today. Today we have clients coming into the Internet, we still use Amazon load balancers where we host our front or edge Envoys. We’re still decomposing our legacy monolith but we’ve run Envoy on it. We have a lot of Go services where we run Envoy. We have a lot of Python services where we run Envoy and then we have MongoDB, DynamoDB, we have obviously stats and we have a bunch of different systems that are being talked to. In this picture here we’ve gone again from a very simple monolithic architecture to now kind of a very large micro service architecture but what you’ll notice from this diagram is that Envoy is being used on every node, so that all communication to all components in the system go through Envoy.
That’s our quick refresher. What most people ask at this point is they say, “Wow, that previous diagram is super amazing but how, how did we actually get there?” To those that are running existing systems and then are starting with the new system from scratch, it’s probably fairly obvious that you can’t go from that four or five year ago picture to what I just showed. It doesn’t happen overnight. Has to be incremental, right, has to show value at each step. If we go for the perfect solution, we’re obviously never going to get there, so there’s a bunch of things I’m about to show you that if I looked at it today it’s not the most perfect solution, but it’s how we got to where we are. What I’m about to do is I’m actually going to go through kind of the history roughly at a super high level of how we built Envoy and how we actually deployed it. Like I’m showing here in this little picture it definitely was not easy.
All right, so what we did from the get go is we started with the edge proxy and most services and most microservice systems once they become of sufficient scale, they’re going to need Python L7 edge proxy. This is because you obviously are taking traffic in via some client typically over some type of DNS connection and then you have to actually parse that traffic, you have to terminate TLS and you have to send that back to various backend systems. I’m showing a very basic photo here but you can see the traffic comes in to an AWS TCP ELB, we go into an edge Envoy reverse proxy system and based on foo, bar or baz we’re sending that traffic to either the foo service, the bar service or the baz service.
Obviously there’s a lot of existing edge proxies right now. There’s proprietary ones, there’s obviously Amazon’s ELB, Google’s got their own load balancer, Edge’s got their own load balancer. Even to this day kind of the feature set that exists in these edge load balancers particularly from an observability perspective is actually not very good and if you look back two and a half or three years ago when we started developing Envoy it was even worse back then. I think when we started developing Envoy about two and a half years ago, Amazon’s load balancers still did not output latency stats with percentiles, so it was impossible to get for example P99 latency stats, which is actually fairly unbelievable and that’s just the tip of the iceberg. You obviously want stats that are on a per service basis, you want just copious stats, you want tracing, you want custom log formats.
It’s probably very hard for you to see in your screen. I’ll be actually posting these slides but in light gray behind that diagram you can see just a small sampling of the stats that Envoy outputs for each upstream cluster. Using this data, even just from an edge proxy perspective, we were able to immediately get vastly better observability than what the existing load balancer solutions were actually providing. In this next slide, which hopefully you can see. This is a screenshot from our front Envoy edge dashboard and you can see that there’s a bunch of basic things here on a per host basis. Obviously we’ve got connections per second and we’ve got requests per second and we’ve got total connections and total request.
These are things that you would typically find in your basic kind of edge load balancer dashboards, but then on the next line we see downstream connection lanes at both P50 as well as P99. We have requests per second to each upstream cluster or service. We have failures on a per serving basis, we have P99 latency on a per cluster basis and this is obviously just a fraction of the observability that we output from our front Envoy edge fleet. When we started with Envoy and we rolled this out kind of just at the edge layer, there was already huge benefit in terms of debugging production incidence. That was essentially where we started. We developed Envoy and we deployed it at the edge.
Moving forward what happened next is obviously going back to that beginning monolithic architecture, we had our edge load balancer, we have our kind of Apache PHP monolith and we have our backend database MongoDB. Kind of moving forward in time we’ve now got Envoy in front of the monolith. This is about two and a half years ago, two years ago. For those of you that are not super familiar with Mongo, Mongo is not so great at connection handling. We have a very large sharded Mongo installation at Lyft and as part of normal Mongo operations, it’s important to limit the number of connections that go into Mongo. That’s because still to this day Mongo is not a fully a sync architecture. It actually uses a thread per connection. What that means is that if you instantiate too many connections into Mongo, things can go haywire very, very quickly.
We were originally connecting to Mongo directly from PHP and as Lyft grew and we were scaling our monolith wider and wider, we actually needed to limit the number of connections that were going into Mongo. We decided to run Envoy just as a TCP proxy on each monolithic server. When we did that, we were able to collapse all of the connections that were coming into that local Envoy and we were able to limit the number of connections that were going into the sharded Mongo installation.
From day one that was just a very simple TCP proxy scenario but once we did that a bunch of other things became clear. It became clear that we can start doing a lot more with that bump in the wire. We can actually parse piece on, so we can actually parse the L7 Mongo language and we can start spitting out incredible stats. We can spit out stats on percentile timings, on the different types of ops that are coming in and out. We can actually via modifying the things that are sent down to Envoy by actually putting call sites into the actual queries. We can parse those out at the Envoy layer and we can actually speed up per call site stats.
We then essentially used this ability that we’re already proxying this traffic to actually start doing L7 proxying of Mongo and spitting out incredible stats. Stats that we didn’t have before, that weren’t even available from the backend Mongo instance. Obviously we could have done this on the MPHP but as you’ll hear from me in terms of why Envoy is so powerful, you only do things in one place. Once we wrote this code in Envoy that allowed us to actually parse out stats, well, this is great. Now we can use this in Python and now we can also use this in Go. That’s incredibly powerful. It basically meant that we could write this code once and get this incredible stat output from one place.
Then beyond that once we do stats we actually realized, well now we can actually start doing other problems, right, or we can start dealing with other problems. Another problem with Mongo is that like I was saying before with this kind of poor connection handling, Mongo has a tendency to tip over when there’s too many connections or there’s too many connections per second. What we can do now is we can actually globally rate limit connections into Mongo by putting a rate limit filter into Envoy. Now we’re doing stats, we’re doing rate limiting and we’re doing it in all app languages. Just from putting that proxy there we’re actually getting this huge operational benefit by understanding what’s going on and actually preventing certain pretty bad failure scenarios. Mongo was next.
Once we actually had Envoy running on our monolith we were already basically running it side by side with our services. We got it running alongside our PHP app, by this time we’re running it alongside Python, we started to bring up Go services. We can start using it for general site car systems, we can use it for ingress buffering, we can use it for circuit breaking and most importantly we can use it for stats logging as well as tracing. One thing to note here is that this is still a very incremental process. We still got our AWS TCP ELB coming into the edge Envoy, we still got the edge Envoy going to some internal load balancer which is fronting all of our services, the traffic then goes to the mesh side car Envoy and the mesh side car Envoy still goes back to the actual service.
This is still kind of using all these bumps in the wire with these internal load balancers but even with these bumps in the wire we’re still getting a ton of benefit and again that benefit is mainly around observability so now we can start doing consistent logging. Envoy can actually generate IDs and propagate those IDs so that we can do log sampling across different services. We can start building and tracing and obviously at each hop we can start getting these copious dots that Envoy outputs and they’re consistent no matter what the application is, whether the application is PHP, Python or Go.
On the kind of circuit breaking buffer inside of things, because Envoy is written in C++ and has very high performance, it tends to perform better with very bursty workloads than particularly PHP and Python but even Go in certain cases. By putting Envoy in front of the applications we can actually do local circuits breaking to prevent the applications from falling over and applying back pressure. Already here even with this very basic set up we are already seeing a lot of operational benefits. Once we do the kind of side car system now we’re left with this typology where we almost had this mesh but we’re still using internal load balancers to actually do service discovery.
Like I was saying before, internal ELBs have not been very good for Lyft historically. They just have not been very reliable. Like I was saying kind of the observability output from them has been very poor, so stats are very limited, logs get put into S3 and are actually very hard to process, no actual tracing and cloud providers have obviously started to actually rectify this over the last couple of years but the situation is still not great. There’s this realization at this point where we have this almost full mesh system but why are we still using these internal load balancers? They’re not actually adding very much.
The next step in our process is to actually get rid of the internal load balancers and go full mesh. In order to do this we have to actually do our own service discovery because up to this point we’ve been utilizing load balancers as our service discovery system. Previously hosts were coming up, they would register into the Amazon ELBs and then the Amazon ELB would actually do that service discovery and do that load balancing. If we’re going to get rid of these internal ELBs we actually have to do our own service discovery.
I’ll talk briefly about service discovery. I’ve talked about this a lot before but historically the way that most companies have dealt with service discovery when they do it themselves has been to use fully consistent systems. That’s to use a system like zookeeper, ETCD or consul. These things don’t work but as the size of the deployment grows, they tend to become very hard to manage because they’re essentially using a full leader election protocol, they’re having to do a bunch of pixels like algorithms to determine what the current state of the world is and as the number of hosts go up kind of the chance of those systems falling over it tends to increase and what you’ll find is that most large companies that rely on zookeeper or ETCD they usually they have full teams that they actually work on keeping zookeeper or ETCD up.
When we were originally designing this we did not want to be in that situation. I’ve just had too much trouble in my career dealing with zookeeper. What we realized is that service discovery is not fully consistent. Service discovery from a microservices perspective is eventually consistent and it was very important for us to break that paradigm and say that all of the hosts don’t have to have a consistent view, they just have to converge.
What we ended up doing is we’ve built a dead simple system with a dead simple API. We built our own discovery service, is backed by dynamo, it could equally be backed by redis, eventually consistent and on every host where we ran Envoy we actually have a cron job to this day that once a minute it checks into our service. There’s a TTL on it, so if a host hasn’t checked in within five or ten minutes it gets swept and then all of the mesh Envoys they talk to the service and they basically fetch all the host info for the upstream clusters they actually have to talk to.
There’s a bunch of caching along the way. There’s caching in Dynamo, we usually have actually access in redis, there’s caching in the actual service, there’s caching in Envoy. The data is fully eventually consistent. It may be five to ten minutes before things actually converge in the worst case if a host dies and the best case first orderly shutdown is usually about one minute or so but because of the way that we layer on our active health checking, so those pings that go out every 15 or 30 seconds as well as our passive health checking. If we see our attempting to talk to a host and it actually fails, let’s say that it has three failures within some period of time we do the host.
This system has turned out to be incredibly reliable. I actually in two and a half years knock on my computer. I cannot think of a single incident that we’ve had based on the system, which is very uncommon. Like most companies that use these fully consistent systems end up having a variety of outages. This has been very good to us. Once the system was in place we’re now able to drop the internal load balancers and we’ve now formed this full mesh. All of the Envoys are actually talking to each other, which is pretty fantastic because now we have stats at every hop, we are no longer reliant on kind of black box technology from Amazon that we can’t see into and if there’s a problem we don’t know if the problem is in the service, is it in the network, is it in the load balancer. This system that we’re showing here just from an operational perspective, from a management perspective has made things so vastly simpler.
All right. When people give talks about the service mesh and this includes me too, I’m obviously guilty of this, we like to talk about the magic of the service mesh and how you deploy this thing that I just showed in here. You’re going to get this amazing stats logging and tracing and all these features. As it turns out there typically is application work that is needed to get these things to actually work. The primary thing that’s actually needed from a kind of the thin client perspective or from an application perspective is we need to do propagation. We have these IDs that we pass around and these IDs are used for tracing and they’re also used for when logging, so we get consistent logging.
Even with a fully transparent system where you use something like IP tables to kind of redirect traffic and make it totally invisible to the app, the app still has to propagate these headers. Now, at Lyft from a simplicity perspective since we knew that we needed to propagate IDs and we didn’t want to actually do a complex IP table set up and at the time that we were deploying Envoy we had probably over hundreds services, we decided that we were basically going to force everyone to use a thin library. The thin library would actually include a client, I’m showing Python here, would include a client called Envoy client. It would be very user friendly and so the user or the client would type in the service that they actually want to talk to, they would send their messaging parameters and then they would fire away and get this magical response.
Internally that client knows what port to talk to the local Envoy on, so again we did not use IP tables. We actually have Envoy listening out of port, 911 in our case, sorry, 9001 in our case, which is our egress port and then we forced all of our services to use this thin client. Because we’re forcing them to use this thin client, we get a bunch of other benefits. We get again the ability to do request ID tracing propagation but we can also guide devs in terms of good practices so that we can encourage them or force them to use certain things around timeouts, retries, various other policies. Instead of relying on devs to still use some type of application language or application library to send requests to some local Envoy, we actually guide devs by using this client into saying, “You know, we’re going to set a default timeout for you. We’re going to set a default retry policy for you.”
We have this thin clients that are written at Lyft for Go and for PHP. They kind of do this very basic functionality. People sometimes ask, “Well, okay but you still have to write this client, right, for, for every language” but it’s important to realize that Envoy at this point is probably over a hundred thousand lines of code and that’s not including all the libraries that actually depends on. These libraries are over hundreds of lines of code, so it’s just vastly simpler to maintain these small very thin libraries and still put all of that functionality within Envoy itself.
All right. At this point we’re about a year in. We’ve done a tremendous amount of work. We’re seeing real benefit from Envoy and now we need to run it everywhere, right? I mean, at this point we are running it on our monolith, we’re running it on our top biggest services, we’re probably covering 90% or 95% of our traffic at this point, so we’re fairly sure that it works but in order to get the real benefit, in order to actually get the real mesh we need it to run absolutely everywhere. This begins the process that I think many folks have been through, it’s not very much fun and we start this many month burndown.
There’s literally no joke involve me with a giant spreadsheet. We had an automated tool that would go and look at every service every day and basically figure out whether it was using Envoy or not and we would look at the spreadsheet every day and basically over a period of many months, we would work with other teams and help them convert. We had a bunch of instructions and it was obviously very easy but it still relied on teams doing that work. It was also a slog partly because of how we do our deployments. Obviously if Lyft were using two years ago kind of what some newer companies have today with a fully abstract container base deployment system where we could have just redeployed and like magically gotten Envoy, that would have been fantastic but that’s just not the reality of what Lyft has.
I mean, Lyft still has a fairly old kind of self based deployments system that runs on raw virtual machines, so it was not so easy for us to actually go and get Envoy running everywhere. We went through this process that lasted a couple of months where we encouraged with a carrot kind of stick approach of getting people to upgrade but at this point it wasn’t actually that hard because developers frankly had seen the value, they were seeing all of the stats, they were seeing the operational situation that having this technology actually gave them. I was surprised actually that we did not have to use much of a stick. It was mostly a carrot based approach.
Once people see this magic of the service mesh, it’s like a drug I think. I mean, it’s just, it’s a very powerful paradigm. Once people see it they don’t really want to be without it. It took a long time but it wasn’t like we had to kind of harass people, it was a fairly straightforward process. Once the process is complete that’s when the real benefits start to pay off because now we can add features that people want and we can deploy them very quickly across all services. We’ve been in this situation at Lyft for well over a year now, probably about a year and a half where we’ve been on full deployment and it’s been pretty amazing actually.
All right, so we’re going to shift gears now slightly because beyond how we actually deployed Envoy in terms of what kind of order we did things, there’s, the next major question that people often ask me is, how do we manage configs. That’s an interesting question that actually has also evolved over time. Like I took you through our deployment strategy, I’m going to also take you through how we’ve evolved our conflict management strategy. Envoy is … Envoy configures JSON. There’s a lot of complaining out there of JSON versus YAML. Let’s not talk about that.
What we originally had is in the very early days is we would have our full JSON configs committed directly into our repo. This was very useful for us from a dev perspective because we didn’t need any back compact, we could obviously change things very quickly, we would have a fully spelled out config per deployment type, so we had a frontenvoy.json, a servicetoserviceenvoy.json. What we would do is when we deployed Envoy we would take the binary, we would build the binary, we would bundle it with the configs into a big tar ball, we would spray it out to all the places where Envoy was running and then our deploy process at Lyft is this pull based deployed process where we run salt on every host and salt would basically figure out how things have changed and then we would kind of unpack the new binary, unpack the new config and we would hot restart on to the new binary and new config.
At this point from a restarting and a conflict management perspective, binary deploys and config deploys are basically the same. They work the same. This worked very well for us, it was very simple, we could deploy very often, again without any worries of back compat. There’s obviously a small little snippet there of what an Envoy config looks like at a very high level. It became clear very, very quickly that the Envoy config has a lot of boilerplate. There’s a lot of things that are going to be duplicated and eventually there’s going to be a lot of dynamic input in terms of what upstream clusters is Envoy talking to and obviously all those clusters have to have a cluster definition and you have to have different listener definitions.
It became came pretty clearer very early that we are going to need some type of scripting to actually build these configs. What we did next is we built a tool conveniently called configgen.py and it uses a templating language, a Python templating language called Jinja and there’s a small snippet here. What Jinja allowed us to do is actually teach these config that we’re becoming hugely boilerplated you know, it’s like a ton of duplication and make them a lot more problematic. Configgen would take a bunch of inputs, it would take a bunch of templates and there’s a template here and it would take the inputs, take the templates and then it would spit out the final configs.
We actually still do this today. This has worked very, very well for us. The templates are pretty readable, the inputs are pretty readable, kind of the entire process is very easy to understand but at this point what we’re doing is we were still during kind of the Envoy deployed time, we still had this monolithic deploy where we would actually take configgen.py, we would build the Envoy code, we would take the templates, take the inputs, we would build the final configs, bundle them into this giant tar ball and then send the tar ball out and basically deploy it.
That worked fine actually for probably the first six to eight months that we were running Envoy everywhere. We ended up whining, we ended up getting into a situation where, which I’ll show you in this next slide where the configs were getting to a point where every kind of service didn’t have to talk to every other service. Now we’re essentially blowing up these giant configurations and we are sending them everywhere and from a performance perspective we have a bunch of Envoys that are potentially health checking and talking to other hosts that they don’t need to be talking to from a security perspective or opening up connection channels that are not actually necessary. Both from a performance correctness security perspective, it would obviously be a lot better if every host was not getting the same exact config.
Now we enter into what we’re actually doing right now, which is this hybrid system. In this hybrid system we take those Jinja JSON templates that I actually showed you. We actually still do our “front” envoy build process to rebuild the code but instead of actually generating all of the templates at initial build time, what we actually do is we now package configgen and the templates in that bundle that we send out to every host. Now what we deploy is we deploy the binary, we deploy configgen.py and we deploy the templates.
What that allows us to do is that on the upper right hand of this diagram, you can see that we have service manifests now. Every service at Lyft has a manifest that says things about it. For example, what its testing procedures are, like what its deployed procedures are. We also have enough manifest networking thing. We make services, put in the manifest services that they talk to, databases that they talk to, caches that they talk to. Now what happens is that when we deploy Envoys since we’re not deploying the final configs, we deploy the binary, we deploy configgen.py, we deploy the templates. Those templates go down onto the host where we actually take the manifest information on the host and when salt runs on the host we merge the manifest data with configgen.py and the templates and we produce the final config on the host that Envoy actually runs.
Now, at this point again things still don’t require back compat and by that I mean we’re still deploying everything together. We’re still deploying the binary and the templates together so that when the Envoy binary gets on the host we don’t need to deal with a configuration that might be for some previous version. This is actually what we do now. The problems with this solution are obviously from a dev productivity standpoint. Lyft is growing to the size right now where we now have a bunch of config changes that are going out. These config changes might be related to edge routing. Like in that very first slide that I showed we’ve got the foo, bar, baz, right? Obviously at Lyft we have hundreds of routes now.
Those routes change a lot, people bringing up new services a lot. We actually have a bunch of template and input changes and we’re getting to a point now we’re actually blocking the deploy of these types of config changes on binary deploys is actually becoming fairly problematic. We’ve known for a long time that we obviously have to break this dependency and Envoy has actually been built from the get go to break this dependency. Let’s talk about what we’re actually going to do there. Oops, okay.
Before we actually do that, let’s talk briefly about Envoy control planes, APIs. From the get go Envoy has always or the goal of Envoy has always been to be a universal data plane and by that I mean, Envoy is going to bundle of a lot of very complicated functionality from a forwarding perspective, different protocols, rate limiting, retry buffering et cetera but we want it to be usable in a variety of different deployment types. We want it to be this universal data plane where we support different APIs that actually allow Envoy to be remotely controlled and configured.
For a long time we’ve actually supported various API and the quote, v1 APIs or JSON rest APIs and we have four of them right now. We have the service discovery service, SDS. That’s actually fairly poorly named. That’s how Envoy dynamically discovers hosts in upstream clusters. We have the cluster discovery service, which actually allows Envoy to discover entire clustering. So like the foo service, the bar service, the baz service. We have the route discovery service. That allows Envoy to discover route tables from a forwarding perspective so slash foo goes to foo service, slash bar goes to bar service, slash baz goes to baz service. We now have the listener discovery service, which is actually allows Envoy to configure entire listeners and entire filter stacks.
With SDS, CDS, RDS and LDS it’s now actually technically possible with Envoy that you can have a very tiny bootstrap config and you can basically load everything from a remote management server. Envoy development has obviously gone ahead of what we’re doing at Lyft and at Lyft currently based on all of the previous slides that I’ve showed you, we’re only currently using SDS. We obviously dynamically fetch host from our discovery service but everything else is actually statically configured via this Jinja JSON templating process.
This is a diagram of our next generation config management system at Lyft and we’re actually, we’ve been developing this for a couple of months now. We’re about to go into deployment hopefully next week I’m guessing. What you’re seeing here is on the left side those are all Envoy components, so within an Envoy process, the cluster manager, the route manager and the listener manager. There’s a bunch of APIs that are actually in play here. The listener manager, it’s going to fetch entire listeners from the Envoy manager service via the listener discovery service API.
Once the listeners come down, they may refer to route tables. The route manager will fetch the route tables from the manager service via the route discovery service API. The cluster manager obviously may be needed to refer to clusters based on a particular route. Those will come from the cluster discovery service API. Then obviously we still have our legacy Python discovery service and in our v1 plan, our plan is to actually continue to use the legacy SDS API to fetch host information.
Now what you’re seeing is that what kind of seem as hybrid approach and this is the way that I think almost all more sophisticated Envoy deployments will actually work in the future. We’ve still got our legacy registration cron job going into our legacy discovery service, which feeds SDS but now instead of doing all this templating Jinja stuff that I was actually showing you before, we now have a static configuration repo, which is completely decoupled from Envoy. Our devs at Lyft can check in route changes, service config changes and then we also have our service manifests. These both get fed into S3 via deploy process and then our Envoy manager service will actually be fetching this information from S3, again it’s all eventually consistent and creating the responses to these APIs on the fly fully dynamically.
That means that an Envoy running on behalf of service foo can get an entirely different configuration from the Envoy manager service than an Envoy running on behalf of service bar or our edge Envoy fleet. This makes again operations at the cost of more complexity, it makes operations vastly simpler because now we can basically deploy every Envoy with essentially a static bootstrap config. All the bootstrap config needs to know is how to contact dimensions servers typically over DNS and how to do various things around standing up at admin port, tracing stats, things like that. I’m super excited about this. I think this is going to make dev productivity a lot faster at Lyft.
That’s kind of what we’re working on now. I want to just briefly touch on future of these APIs. We’ve been doing a ton of work actually on our quote v2 APIs. Like I was saying, the v1 APIs are all pure JSON APIs. The v2 APIs are actually going to be gRPC APIs. We’ll also support JSON but the idea behind them being gRPC API is that we’ll route by directional streaming and actually in certain cases more features but in general it’s the same spirit as our v1 API. So kind of briefly going through what we’re planning for these v2 APIs is we’re renaming the SDS API, the EDS API, the End Point Discovery service since that’s a better name. That will be how we will fetch host information but we’re also actually making this a little bit more of a robust API.
In addition to actually sketching host information, Envoy will be able to report host info. It will be able to report things like CPU load, memory usage, things like that, and that will potentially allow a very sophisticated management server to actually use load information dynamically to determine assignments, which is actually very interesting. This kind of moves Envoy into again that universal data plane where we’ll actually be doing more of a global load balancing system. We still have our cluster discovery service, that’s about the same. We still have the routes discovery service, same, listener discovery service, same.
We’re adding a new health discovery service API. This is actually going to allow Envoy to be used as part of a centralized health checking mesh. A central health checker can actually assign Envoys to health check its subset of endpoints and report health checking information back. Again, as part of this theoretical global load balancing system, we can now break this end squared health checking kind of problem where a subset of Envoys can help check and send that information back to a global manager.
Finally, we’re going to implement what we’re calling ADS, which is the aggregated discovery service. This is really just an ability, if a Management Server desires to get all of the APIs flowing on a single H2 stream. That basically means that one Envoy is always talking to one Management Server over one H2 stream and updates can actually be ordered. This means that a Management Server if it desires can actually sequence updates in an order that will mean that Envoy will not fall for traffic in certain cases. For example, clusters could be sense followed by end points, followed by a route table update that will actually use those clusters and end points that will mean that when the atomic switch happens that we know fall off for.
What you’re seeing here, which I think is kind of interesting is that again we are moving to a world where there’s a tremendous amount of development being done on Envoy. I feel very hopeful that Envoy is going to get wide deployment kind of as this universal data plane. But what I think we’re going to see, which is really interesting is we’re going to see providers come, I think with probably proprietary management servers that some of them will do some very powerful and very cool things like around global load balancing, centralized health checking, things like that. That’s kind of where I see things going and I think we’ll see a lot of this over the next six to twelve months, which is fairly exciting.
Briefly going to talk about Istio and just in terms of how it relates to what we’ve been talking about. Like I was … Like I’ve been saying to different folks and I was saying on Twitter recently, what we have today at Lyft in terms of how we deploy Envoy, if I were to start from scratch today, I probably wouldn’t do what we have, right. Like I wouldn’t have this Jinja JSON system with static configurations but what I hope that you can see from this presentation is that, that’s just the way things develop. We have gotten a lot on in a very short period of time and sometimes that requires cutting corners.
What excites me about Istio though is that Envoy is this data plane but we also have this control plane need. At Lyft currently our quote control plane right now that we have in production is again very simple. It’s a bunch of Python, like a bunch of Jinja, it’s a bunch of salt and we’re obviously moving to a much more sophisticated control plane but what’s exciting about Istio is Istio is really a decoupling of the control plane from that data plane. Istio is an entire project that is focusing on building this universal control plane. For me that’s super exciting because there is so much work that has to go into making a control plane robust and have proper security and roll back and roll forward and versioning. I mean, there’s just so much to it that I think there are just two different layers and it’s important to decouple that control plane layer from that data plane layer.
I think for a lot of folks particularly people that are excited about service mesh and kind of are excited about using Envoy, I think Istio is going to be a vehicle that gets people into using this because it’s going to bring Envoy and a lot of the configuration complexity around using Envoy I think to a much broader set of people. Again, this picture here is obviously showing kind of what I was just talking about where Istio control plane basically implements all those Envoy APIs. It implements SDS, it implements CDS, it implements RDS and ironically or maybe not Istio has consumed some of those Envoy APIs sooner than Lyft has actually consumed them but that’s just the way things work when you’re running an existing system.
I am very excited about what happens going forward. I’m also excited just in general about this whole movement towards these management servers and like I was saying I think some of them are going to be proprietary but I think there’s going to be open source ones too including Istio. From a lith perspective I think we’re going to have to figure out how we treat the management system that we are building currently. I think our plan is to open source it but as you pop higher in the stack, these systems tend to get more and more domain specific. They tend to be built into Lyft infrastructure or right now Istio is very bolted into kubernetes though they’re kind of working on other stuff.
It gets harder and harder to have the right abstractions in place but I am very excited about that. Anyway, that is all I had right now. Thank you very much for coming and listening. You can reach me on Twitter. For those of you that are new to Envoy we’re very excited about building a larger community, so definitely reach out. There’s the link there for the web page for Envoy proxy and the Twitter account as well as Istio and thank you. I’ll be happy to take some questions.
Kelsey Evans: All right, awesome. Thanks so much Matt. We’ve had a few questions come in already. If you have questions please feel free to go ahead and put those into the Q and A or the chat and we will get started. So the first question is, at Lyft are the developers who are responsible for a given service also responsible for the Envoy configuration?
Matt Klein: The answer is a little. What we do at Lyft, which I highly recommend that most people do is we provide defaults, and we are fairly strict about defaults. We allow developers to tune certain portions of the config but only a very small portion. The networking team basically builds the templates and we control almost everything of what people do but we allow them to customize obviously who they’re talking to but we also allow them to customize things like timeouts, rate limits settings, circuit breaking settings, things like that.
That kind of gives us the best of both worlds where we control most things to make sure that people are doing the same stuff. We allow people to customize things but we also actually force people, I mean force teams to understand basic networking concepts. We actually, you know, we force them for example to run red light test and to run fault injection test and to actually know where they’re breaking limits or to actually set their circuit breaking settings appropriately. The answer is both. Like we require people to understand a little bit, to kind of know what’s going on and to set certain things but the networking team controls most of the configuration.
Kelsey Evans: Okay. What’s the network latency overhead introduced by these Envoy proxies?
Matt Klein: It’s, that, it’s a very common question that’s asked. It’s hard to answer because it really depends on what Envoy is doing. Like Envoy can be configured to do a whole bunch of stuff that ranges from barely anything to tracing, logging a bunch of stats. The ballpark number that I typically give people is probably around one millisecond per hop but that’s a very hand-waving number and I would encourage you to do your own performance investigations.
Kelsey Evans: Okay. You mentioned that you didn’t like the static and templated configuration approach and you wouldn’t use it again if you started today. Can you talk a little bit more about that?
Matt Klein: Sure. Well, it’s not that I wouldn’t use it because it’s, so that was maybe poorly said. I wouldn’t use it right now with where Lyft is at today. I would have done the same thing that we did back then mainly because it’s a super pragmatic approach. Like it’s very easy to understand, it was very easy to deploy. Like if I were starting from right now we are having scaling problems from a developer productivity perspective where changes to that static config and actually getting it out with our binary and deploy process is actually holding people up. If I were starting today, I would probably have gone directly to the next gen manager system that we’re actually building now but I don’t know that it would have changed anything from what we did when we originally deployed Envoy with the number of services that we actually had. I still think it’s a pretty pragmatic approach.
Kelsey Evans: Okay. Is envoy being donated to the CNCF?
Matt Klein: I can’t promise anything but I can say that we are looking into that fairly heavily.
Kelsey Evans: How do you see your close architecture AWS infra versus open kubernetes in Envoy tools co-existing going forward?
Matt Klein: I’m going to see if I can understand that question. I’m thinking the question is like how we will we consume open source technologies versus kind of what we’re doing today. We run in the cloud at Lyft so there’s no way around us having to use certain proprietary technologies like as it is today without going into a huge long conversation. There’s really no way to get rid of the AWS TCP ELBs at the edge. Like we pretty much have to use them.
From a networking perspective we are already trying to remove as much proprietary technology as we can mostly because it’s easier to manage from an operations perspective. We are also starting to heavily look into kubernetes. I think that’s going to be a multi-year effort for us, like you’re not taking an organization like Lyft and move them to kubernetes in a month. Like it’s a big process but over time we would like to get rid of ARPA spoke systems and we’d like to converge on more industry best practices. Ultimately long term I would love to see Lyft using Istio but it’s going to take us a while to actually get there.
Kelsey Evans: Okay. The Istio slide mentioned full transparency to the application. Does that mean the late client library won’t be needed?
Matt Klein: That’s, kind of what I was talking about before is there is no way around the library if you need to propagate IDs which you’ll need to do if you’re doing distributed tracing. It’s just the fact of life. I think as an industry one of the poorest things that we’ve done is actually educate and help people on how to do all this propagation work.
I’ve been chatting with a bunch of people lately and I’m really hoping that we can get kind of some open source propagation libraries in place, they’re not even tracing specific just because it’s a common concern of you’ve got a request coming in one side or you’re doing some work on one side and you need to propagate some state to some other side. There’s common patterns that you use in each language. Like in Go whether it’s context or in Python whether it’s the G-event context or thread local storage or something like that.
I think there’s no way around this thin library and I think that … I think that depending on how Envoy is deployed particularly with use of IP tables, certain things can be made more or less transparent. I do think that the library has benefits in terms of like I was saying allowing developers easier access to certain features. Like there’s a bunch of things in Envoy that can be overwritten via headers and that’s actually super convenient. It doesn’t have to be part of the route config, so it allows developers to dynamically tune things. I do think that the library will never go away. There’s just no technical way to do it but I think that we can do a better job of making that library more accessible to people.
Kelsey Evans: Great. Okay, that looks like all the questions that have come in. If you have a question that you didn’t get a chance to ask, you can post it in the summit getter which is gitter.im/datawire/summit and we can make sure that question gets answered. With that we will wrap up the virtual summit, so thank you Matt and thank you to all of our other speakers and all of our attendees for listening. You’ll get an email this afternoon with the recording for Matt’s talk and all of the other talks are now available on microservices.com/talks. Feel free to check those out and we hope to see you at our future summits, so thanks Matt.
Matt Klein: Thank you, bye.
Kelsey Evans: Bye.