What is a Service Mesh, and Do I Need One When Developing Microservices?

Daniel Bryant


Daniel Bryant: Hello everyone. Welcome back. I hope everyone is suitably caffeinated. I appreciate lunch is next as well, so I'm going to try and make this as interesting as possible. So, we're going to be learning today what is a service mesh and do I need one? Now, a few of you may notice I've actually stack in the cloud native bit there. The original title didn't have the cloud native, but I think we've been hearing quite a bit about cloud native over the last couple of days, and it's super important.

I think the expectations and reality are somewhat different between what we're sort of pitched, as in this is the perfect version of cloud native and what we as engineers have to put up with. The pitch is we cover up our applications, loosely coupled seller bound in concepts, put it in containers, ship it, and we're all good to go. Yeah. Cloud native in nutshell.

The problem is many of us have legacy apps or just apps, moneymaking apps. There are often a bit of a tire fire. The snag is when you put tire fires in a container, what you actually get is literally tire fires in a container. You'll notice the firefighters in the bottom there, this is what we call as DevOps in our industry. This is how I make my money as a consultant here in Germany, in the U.K. and also in the U.S. as well.

So, the key takeaways today. I like to prime people with what we're going to be learning. Fundamentally, we're going to be looking at a service mesh. It is a dedicated infrastructure layer for making, at the moment, service to service communication safe, reliable, observable, and configurable.

There's also going to be some messaging stuff coming, people looking at Kafka. Service meshes are also sort of encroaching on Kafka space now or rather not encroaching, but being combined with Kafka. So, not only can you observe service to service, but you can observe messaging in your infrastructure in general.

I think this kind of thing is valuable as we move from the deployments of complicated systems. I'll explain more by what I mean by the word complicated. So things like monoliths, classic server that kind of thing. As we move to complex cloud native system, micro services, Serverless, FaaS, all this good stuff.

We heard it last night, if anyone came along to the talk last night with the panel last night, be very careful. This is new technology. Owen who did the talk before me said the same thing. We are exploring the space. There's a lot of pain potentially with micro services particularly around communication, hence why service meshes are being created, but this is a super new space. Not much of the technology is production-ready.

I've worked with clients that are using it in production, but they're taking the risks on board knowing that this stuff, other than Linkerd, a lot of the technology you'll see today is not really battle-tested and this can be quite a challenge.

This is me. I love getting involved on the Twitter as @danielbryantuk. I work as an independent technology consultant. Before my scenes, I am writing a book on Continuous Delivery in Java. That one has been going for a while. I think last year at this conference, I talked about that. I'm still working on it, but writing a book is hard. I'll tell you that. It's a lot of fun, but it's hard.

So, I work onto the tech side and the organizational side. I think last year I actually did an organizational talk, but this one is going to be very much tech-focused. Don't get me wrong. As you hopefully heard from all of us at the panel last night, the organizational stuff is probably more important than the technology. I'll say it again. The culture, the organizational stuff is probably more important than the technology, but today, this is a tech-focused talk. All right, just kind of caveats have been said.

Setting the scene. So, we've moved. I started my career in the '90s roughly. I know we've been developing software since probably the late '50s or '60s, but I started my career then. It was monoliths and single language, these kind of things. Cloud didn't really exist back then and we optimized for stability. As we moved through to 2000s, famously in 2006, Amazon released their S3 and SQS and the cloud is just going from there.

I was, at this time, a full stack developer because I did Java and JavaScript. I was full stack. Then that term has evolved even more now, but we had very much focus on monoliths, coarse-grained seller and smart pipes like ESBs and MQs. Are there people who work with enterprise service buses in the audience? I appreciate I might be a little bit out. Not everyone is ... Yeah, a few hands. Great stuff. So, I'll cover some of that later on as well.

2010s has been make everything smaller. Fundamentally, with micro services, it is just about modularization. A fantastic paper from David Parnas, it's written in 1973, '74 talked about modules. You can literally read the paper and replace in your mind the word module with micro service and everything is good. It's about cohesion. Chris talks a lot. Yeah, Chris Richardson. It's about cohesion. It's about coupling, but we are using the network boundaries to enforce those principles. We're creating network boundaries around our services.

We're moving more towards dumb pipes, so the rest ATDP, decentralized communication. We're bringing in many different languages, not just Java, not JavaScript. There's often Go and Ruby, and other things now and everything is software defined. Yeah. For better or worse, I used to rack and stack and service. Now I do everything via an API, via terraform, and some all these kind of things.

Accordingly, it's making our teams change to more cross functional general-purpose teams. [inaudible 00:05:54] called Findev kind of business aware developers. Google's SRE, Site Reliability Engineers, in the middle and then platform teams running things like Kubernetes, ECS may source that kind of thing.

Now, as we move through these things, I think and I'm borrowing from the Cynefin model, if anyone is interested the Cynefin model is a model of how we as humans deal with complexity. I think comparatively speaking, we've moved from a relatively simpler era and I'm not saying it was simple, but in the grand scheme of how we as developers, how we as architects build systems, we've moved from simple to kind of complicated and ultimately to complex. The jump between complicated and complex is very big.

Complicated system, you get cause-and-effect and it's very easy-ish to reason about them. Simple systems are literally just simple. The jump between complicated, where we have cause-and-effect to complex is big in that you have emergent systems. There's many more components involved, things are moving at different paces. As humans, our cognitive abilities, it's hard to understand what's going on sometimes particularly with things like database triggers, with Serverless, all these kind of things.

So, be aware of that jump. It's quite big. We're trying very much to avoid the fourth area in Cynefin, which chaotic. I encourage you, if you're keen to understand, we call sense making more, how we as humans understand the model complexity, Cynefin framework, it's [inaudible 00:07:28], is well worth looking into. But I think we are very much here, and I'm kind of wrapping all these things up and many of us have talked already at the conference about this as being kind of the "cloud native" space. I think it's a useful word.

I think that sometimes that micro service DevOps cloud native it's very easy to misunderstand the word. Unfortunately, when a word gets popularized, it gets abused by vendors. No, just all of us to be honest, but I like the term cloud native for bundling a bunch of things up. I think what we're going to focus on today is the dumb pipes and the decentralized communication.

Has anyone heard of the eight fallacies of distributed computing? Yeah. Few, awesome. So this totally got me like ... and I think it's also the eight fallacies of cloud native. When I first started developing on Amazon's and deploying on Amazon's code, all these things I found out. It was a nice paper. I've referenced some stuff down at the bottom there. You can read more about these eight fallacies, but I think all of us as engineers particularly if you've come from the monolith, kind of monolithic era, maybe using PaaS or deploying on infrastructure that you own, it's very easy to be burned by these kind of things.

In fact, as a consultant, the most common kind of response I see it's when I'd start pointing at these things to developers. I get called in. Something works in dev very different than it works in production, not quite sure why. I often say, everything in Amazon and any cloud is over the network. Block stores, databases, everything is going over the network. The general approach to how we deal with that is this is. It's kind of yeah, that'll be fine. That'll be fine. Yeah, yeah, while in reality you have to code or at least put your infrastructure in a way that we can prevent some of these problems.

So, what do cloud native comms look like? In cloud native communication, I'm a big fan of Christian Posta's work. Christian comes from the Java EE, ESB, MQ days. He's worked for Red Hat and they work for Netflix. They work for a bunch of different companies. He's got a perspective on classic technologies and more modern technologies. So, I'm a big fan of Christian's work.

He talks that in cloud native world, services communicate over an unreliable network. Anyone who's used Amazon, Azure, Google for any period of time, you just get split brain scenarios. You get networks dying. You get latency. These kind of things. Interactions are nontrivial. We've got micro services. We've got Serverless and frameworks. How are they interacting is not always trivial. Sorry. It's nontrivial.

But there's lesser value in understanding the network, both a static snapshot, how all of my components link up together and both dynamic, how at runtime does the network behave. We're quite good as architects understanding the static snapshot of the network, but the dynamic part is where we're not quite so good at yet.

The application is ultimately responsible for handling any problems. So, if something goes wrong in the communication, you can do things at the infrastructure level, but ultimately your application has to deliver the user experience. If something goes wrong, do you hide it from the user? Do you show a stack trace? All these different kind of things.

So, we've been here before though. These are not kind of [inaudible 00:10:53] popped up. People have tried to do RPC with Qs. I don't know if anyone has tried to do that back in the day. We tried to make their communication more robust, but then we introduced latency and understandability was hard of these things. We brought in these magical enterprise service buses. This is Christian's work. Actually, Christian's great graphics here.

The problem with the enterprise service bus is not only was there a sort of single point of failure, but we also put business logic in there. They become kind of smart pipes. When we were deploying services, we often had to deploy our code and something to the ESB and that coupling is bad. It reduces our flexibility.

Also, we're seeing the same happens with API gateways. This is a presentation I did a couple of years ago, the deadly sins of micro services. I've seen people try and put API gateways as the backbone for their communication framework within the network and a bunch of other things.

Basically, if anyone has used Zuul, Netflix's tech. Zuul is awesome, but it's very easy to inject dynamic code in Zuul, and then you have that problem again where you've got some business logic in the gateway and some business logic in your code. The more the business logic is spread out, the harder it is to work with. So, be cautious of these kind of things.

One thing I would mention in service meshes are very much focused on what we call the east-west traffic. The traffic within your data center, the traffic within your services. Things like API gateways are focused on what we say is the north-south traffic, the ingress and egress. The ingress into your application. Egress maybe to external services. So, service meshes are focused purely on the east-west, purely within the boundary, at least at the moment they are anyway.

So I, like many people were, have worked with ESBs. We were super keen when we started developing micro services to not reinvent the ESB pattern and not use any ESB. There are some good ESBs, and MuleSoft do have a very nice open source lightweight ESB, but some of the vendors, I will not name because I'm being recorded, but some of the vendors do quite heavyweight ESBs that maybe lock you in. So I was very keen. My team and I were very keen to avoid these kind of things, but we had a bunch of problems.

The first one with our micro services, this is back in 2014, was service discovery. Service discovery, we reiterated, but eventually we used something like SmartStack. I don't know if anyone has bumped into SmartStack by Airbnb. You kind of have your application here. You're running nerve. It's like almost as a sidecar type of process or as a process on the EC2 instance.

Your application used to link up with nerve and the synapse to register where it was located, where on the network it was located, and then this application could use synapse to discover oh, here is how I talk. Talk via local HAProxy, in this case. We used NGINX I think at the time and then we used a bunch of other tools as well. We had our application talking via the proxy. So, this is like an EC2 instance here. Apologies over this side.

So, the box is like an EC2 instance there. We were talking via an HAProxy or NGINX process that was on the same EC2. We had many applications talking via the HAProxy, the NGINX. If the service we wanted to talk to was on the same EC2 instance, it was smart enough to use the Loopback Adapter and you could just use localhost. It was quite a nice bit of kit.

Our approach to fault tolerance was pretty much there. So the time again and we got burned. We were primarily a Java stack. We were using Spring at the time, so we leveraged Netflix OSS. This is before it became Spring Cloud, but we used Hystrix and we started looking at things like Ribbon and it was great. That project was pretty good, but then we moved on to another project where we were using Java and Ruby, and the problem was all the Netflix stuff is library based.

So, the Java implementation of fault tolerance was quite different than the Ruby implementation of fault tolerance. If you all lean on Java or all lean on Ruby or whatever, picking a library is great, but if you're using polyglot languages, which more and more of us are, this is a big challenge. This is why with service meshes, we're trying to push some of the functionality from the likes of Netflix OSS, which is great tech, but remember it's at least five years now. We always hear the echoes of when Netflix have been. It's great tech, but we're trying to push some of this functionality down into the fabric, down into Kubernetes, these kind of things.

Phil Calçado I think he's explained this really well. He talks about we put a lot of our communication logic into our application itself. We're using libraries like the Netflix stack and we keep the networking stack, our ATDP drivers, our TCP drivers separate, but we're baking a lot of these things in and it causes problems. We need to update these things or when we're doing different languages. We want to pull them out. We want to have our services completely separate from our actual say circuit break and service discovery, these kind of things.

Ultimatley, this is what we've got with things like Envoy and Istio and so forth. We have this sidecar model. A sidecar, at a general, is just a separate process that runs alongside your application. In the Kubernetes' world, it's a port and you have multiple containers, but we've ran sidecars in an EC2 and so forth. It's just a separate process that aside your main application.

We're assuming that the communication is secure within this boundary. So, my Service A wants to talk to B. It does survive the sidecar, but this communication hop here we're assuming it's secure. There is some stuff you can do. You can encrypt when you're going over localhost and stuff if you want to, but quite often, we do make the assumption that this bit here is secure and this is an un-trusted network in the middle, so we communicate using sidecars. I'll explain more about what sidecars can do in a minute, but that's the fundamental partnering of service meshes.

NGINX have been talking about this for a long time. So, I actually finally first bumped into I believe they called it the fabric model several years ago. The snag was you had to NGINX Plus and the commercial offering, which is great. Several clients I worked with did use NGINX Plus, but others didn't want to. It was quite tricky to recreate some of the service mesh functionality with the open source NGINX, but this is pretty much what we're calling the mesh.

The mesh is the thing between the services. Here we have like services talking to an NGINX Plus, that's our sidecar proxy. The actual proxy, the sidecar, is talking to the other sidecars and this forms the mesh. That makes sense. Christian Posta, similar kind of thing. The mesh in his world is that he has services with sidecars or daemon processes. Again, the mesh is this nebulous thing that sits in the middle and handles things like service discovery, like fault tolerance, all those things.

If you're familiar with say the MuleSoft model of how you implement micro services, MuleSoft have different layers within their micro service stack. You can pretty much say that the mesh is the bit in between the layers, but they have some [inaudible 00:18:19] say ingress as services that talk to, I can't remember the name now, but it talk to the services, and then they have process or backend facades on top of legacy APIs. So, the mesh is going to pretty much, in this model, sit in between all those services interacting. As a request comes in, it would go down through the stack, and the service meshes would handle various bits and pieces.

If you're [inaudible 00:18:45] model, it's pretty much operating at layer five and six. It does touch layer seven and it does touch layer four for various reasons, but just bear in mind that layer seven is where we are coding. So, we can only make application or business decisions typically at layer seven. So, a lot of service mesh technology at the moment does not interact with layer seven. It does have good reasons because if you want to interact at the business level, you need a library. We're back to the same problem with the Netflix stuff.

Now, people we're trying to say this can be probably lightweight libraries coming in for the likes of Istio and Envoy because there's a lot of advantage to being able to control what happens at this level. If something breaks at the lower down the stack, ultimately you can show a failure page or whatever, but your application might know how to better handle that. It might back off or it might show error page. You, as we I should say, as developers and engineers, we can make a decision at layer seven. So just bear that stuff in mind.

In terms of the actual service mesh features, so I'll run through the high level then go in perhaps more depth. I do appreciate. There's probably quite a lot of information coming at you. I will post the slides later, a whole bunch of links. This is a little bit of a brain dump presentation to make you aware of the good and the bad, in Germany however I would with my presentations. But the first thing in service meshes is you can normalize naming and logical routine. So you can create a immutable artifact. Liz talked about this in her talk.

The first thing, you create your application, you reference say, the user service there. Then you can deploy that application, that immutable container in every different environment you have, dev, QA, in production. The service mesh can rewrite appropriately. The service mesh in each environment can make sure you map what we call user service to an operational location, to an IP address.

You can do traffic shaping and shifting, things like load-balancing. We can hide services. We deploy them, but we don't release them, for example. We can do something that's supercool, which is per-request routine. So we can pick out high priority customers. We can pick out us doing test runs in production and we can shift and shape the traffic accordingly. I got through my notes.

We can add baseline reliability. As I mentioned, when you hear circuit breaking in this context, it's not Hystrix. It's circuit breaking at the layer four, layer three-level. We just stop overloading a service if it's not responding. We stop overloading a service if it's doing a Java garbage collection or that kind of thing.

We can increase security. One of the things I often wanted to in a code I created four years ago, three years ago was the ability to have mutual TLS. I have all my applications running in say Azure and Amazon and I want within each of those services to communicate over mutual TLS. Not just client server authentication and authorization, I want to do mutual TLS so both parties know who they're talking to and can prove it cryptographically. This was hard until we had service meshes.

As I mentioned before, understanding the dynamic properties is really hardening micro service system in a cloud native system. Services are being deployed, upgraded. Instances are dying. We're rescheduling containers. Stuff changes all the stuff and we as engineers need to understand what's happening. We need top-line metrics like success rates and we probably need some form of distributing tracing. We need to understand when a request comes in ingress, how is it handled down through the stack? Is there one micro service that's performing really badly, for example?

Something I realized, so I always check to a bunch of people at the end to help me with this talk because as much as it's for me presenting and it almost takes a clearly mind, a bunch of people were generous and lended their time to help me understand service meshes about nine months ago. Matt Klein, who created Envoy works at Lyft, said that Lyft they use Envoy on their service mesh to provide sane defaults.

So anyone can push anything to production, but it always communicates through the service mesh. If you suddenly go crazy with your traffic, the service mesh will shut you down. If you're causing lots of problems in other services, the service mesh will shut you down. I thought that's a very nice way that you can clearly override it if what you're trying to do is the right thing, but he was using or the team in Lyft are using the service mesh to control sensible defaults when new things get deployed to production.

Lyft have been using Envoy in producton there I think for a year or two and at crazy volumes. For people who weren't familiar with Lyft, it's an Uber competitor and Lyft are based primarily in the U.S. So I often use Lyft when I'm out in the U.S. So it's very much a car hailing app, but they have crazy volumes of traffic. Their app is very chatty, for example, so they have proved a lot of these concepts in production, which is very nice.

Diving in, in a little more depth now to each of those things, first is the naming. As I hinted out, you can have logical names over here, so the user service, the checkout service. As things get pushed down through the stack, our service mesh can map it appropriately in real time as our topology, as on network topology is changing, as new containers are being spun up and shut down. Owen mentioned this in his talk. It was super hard sometimes. It's so dynamic. It's super hard to understands what's up and what's dang. So we as humans don't want to do that. We want to offload it something like a service mesh.

Once you do offload these things to a service mesh, you can do smart load balancing and load balancing internally. So, a lot of load balances like as far as I know, I appreciate these things keep changing, but ELBs and NLBs and CLBs in Amazon, they're quite dumb in the way they do internal load-balancing. With things like Linkerd, you can actually weighted in much more depth. Weighted, you can look at latencies, you can do all these clever things, but it's a bit more in tune with things like Kubernetes and like the network fabric itself.

Amazon is sometimes the lowest common denominator. I love Amazon stuff. It's awesome of course and I'm sure they're working on technology just like this, but at the moment, Linkerd gives you more internal load-balancing options. So if you're working on a large-scale or high-performance system, first ask yourself if service mesh is all right for you because they do provide more network ops, but things like Linkerd can give you a slightly more performance experience than if you're trying to do some of the other things manually.

So yeah, I've just talked here. As it goes down for the stack, I'm looking from my user service due dotcom and it gets mathed down say host and the port and then the IP, for example. We can do traffic control. So for example here, I think I've labeled them. I've got service A trying to talk to service B, but we've deployed a new Service B, an updated Service B.

Now, imagine this whole thing is running in Kubernetes or ECS or something. I like that. At the moment Envoy is primarily ... Istio is primarily skewed towards Kubernetes, but in theory you could run this anywhere in the future. It will be like that and we want to do traffic setting. We want to do a canary release. We want to have 99% of our traffic going to our battle-tested service and we want to do 1% going into our new service.

Well, we can do that via what's called the control plane. The control plane, all of us are in traffic control planes overtime without realizing it, but it's the mechanism, the UI, proxy are the brains that interacts with what we call the data plane that does the thing. So, the control plane is our mechanism for interacting with the system. Ultimately, we specify a bunch of stuff. For example, some stacks, we say, here I should have changed the weight to 90 and 10. Istioctl, for example, or my command line, or maybe if via UI.

The pilot parses it down to Envoy, interacts. When the services interacts with Envoy, it bears these things in mind and splits the traffic accordingly so you can do canary testing. You need to be monitoring both services and both responses to see if your new service is doing good things. Yeah, that's just Envoy there.

You can do a bunch of different things. Kelsey Hightower actually has got a fantastic video. I always going to mention Kelsey in any talk. Kelsey does amazing stuff, but he talked about how he was using, another play there. You say do request routing and shifting and shaping based on headers. So you can have beta that users are willing to put up with a slightly degraded services to get the new core features. Or maybe on here you can say anyone who has an iPhone gets this experience. So you can do some quite funky things. With great power, comes great responsibility I might add.

So be careful what you do put in all these things, but don't forget, all these conflicts can be version controlled. Put in Git, do PRs, chat about these things. Rather than just calling APIs and no one's really sure what state the system is in, you can be quite disciplined and setup continuous delivery pipelines to insist that your practices are being followed, but you can version control all this stuff, which I think is really nice.

You can per-request routing, that's kind of funky. So, a few clients, actually I'm working with DataWind looking at this kind of stuff. We can identify individual requests coming in on ingress. So maybe I am browsing production and I can specify a certain header. Now, obviously be careful with stuff because someone else could probably copy my header, but one example we've used is where we're doing test traffic. We put a special header on and we actually have in production our new service deployed.

Based on that header, quite far down the stack, we can route to my test service. Now, you have to careful that test service is not doing something like mutating data or if it is mutating data of like a test database. You can either choose to have the traffic sort of interact with the system for real or you can shadow. You can just basically pump the traffic in, watch what happens in your new service, but the new service never responds. The original request continues down the stack, for example, but this is very fine-grained.

You can basically per-request, if you're familiar with chaos engineering, Netflix have got a tool called fault injection testing, FIT, and it works from this principle. They specify users they want to test faults with, they put a header effectively and identify that user at ingress and then they mess with that request handling down the stack, but they use technology like a service mesh to identify who is interacting with each service.

You can do things like timeouts and deadlines, pretty standard stuff. Don't forget again, this is layers three and four, so your application is maybe not aware of these kinds of things. So, if a request does timeout, make sure your service realizes it or if you're not careful, you can be getting stuck in a retry storm where something times out and the service at the bottom doesn't realize it and it keeps churning away. The user retries and then you get another request going down and you get this kind of crazy situation before me.

Deadlines are interesting. You can deduct the time going down, but I haven't really seen this used much in production yet. It's a nice idea that we can be somewhat intelligent at each service depending on the time we have left to handle a request, but it means we have know about this stuff in layer seven. We have to have a library that interacts with the service mesh and can figure out how much time we've got left. I think it's quite an advance pattern to be honest.

I mentioned about circuit breaking. So, if a service is down or a service is overloaded, you can route around it or you can stop sending it traffic, but bear in mind this again is at layer three, layer four level. Our application isn't particularly aware in comparison with something like Hystrix. Hystrix I can say, if this service is down, fallback to this service.

The example being if you had say a recommendation service that wasn't working. You can then fallback to generic recommendations. Then if that service wasn't working, you could just fallback to hey, show this page of new videos, kind of thing, but this is more at the network level, at almost the wire level. We would have to do some more intelligent operations in the application using something like a library, which has its trade-offs to make a more Hystrix-like decision.

I mentioned about mutual TLS. This is kind of funky actually. So, if people have used Let's Encrypt and many of the cloud vendors have got their own way of using certificates now. So, it's much more easy than when I started my career for issuing certificate TLS certificates. Now, what we can actually do is have something like a service mesh handle rattle for us. It can rotates certificates. It can apply to Let's Encrypt and get new certificates and we can guarantee of the network that Service A is talking to Service B, and Service B is talking to A.

Mutual TLS, not in the normal web like client server model when we don't have mutual TLS. We typically have one-way TLS. This is quite nice. Inside our infrastructure now, we can guarantee services are who they say they are. It's kind of nice.

Following on from that, we can do things like communication policies. Because service meshes can inspect the wire level traffic, not only can you do basic stuff on HTTP in terms of if a user is trying to access this endpoint, but you can also look in say Mongo protocol, you can look in the Kafka protocol and service meshes are moving more, more towards the space where they are the one true source of how we secure a system in terms of because ...

Well, I said the one true source. Obviously, you have things like IAM on the actual assets themselves. You have you secure the container, you secure the Q itself, but in terms of user traffic, we could use the mesh to say various quotas, various metrics, let's say white lists and things like that. And because the service mesh is inspecting all the traffic, it can then make decisions.

Calico have a really nice blog post if you're looking to learn more about this. This is something I do think is going to be pushed down into the infrastructure. We'll see this a lot more in Kubernetes and in ECS. I know Netflix for example are using something called OPA to define this, Open Policy Agent. I saw a great talk. I think it was at Coupon in Houston last year. So have Google for that, where a Netflix engineer talked about they're doing a service mesh like thing internally. They're creating their own parse.

There on Netflix don't forget, they've a much bigger scale and much more on specific requirements than many of us have, but they're creating their own parse. They're using OPA to define their interactions, not only between say the actual when you're browsing for video, but even when you're watching and a whole bunch of other stuff. So, for them, they've got many different interactions and EC2 and there's interactions with things deploys in [inaudible 00:34:35] they're using.

This is going to be their single policy for defining what user can do what and the relationships between these users. These skimmers that says what people can do are pushed out to the service mesh. Service mesh enforces these kind of policies at the edge of each service.

A really cool thing that service meshes offer is visibility. I'm genuinely quite excited about that. There's the obvious stuff that as a service mesh looks at every bit of traffic in your infrastructure, it can do your basic, what's the network traffic, how many errors am I seeing? What's going on in the system. That's using Prometheus. This is actually me spinning out Istio locally. I'm using their sample book information application. I've made some requests. I'm getting all the data from all communication going into Prometheus. I'm looking at Prometheus.

But the cool thing is you also get like a service graph because again, the service mesh is literally the fabric sitting amongst all the communications. You can figure out what's talking to what and in what percentage? Is it looking at a new version of the service, looking at old version? You can also do things like distributed tracing, which is seriously awesome.

I've tried to do stuff using Zipkin in the past in a Java stack and it was amazing, but when we tried to introduce other languages into the stack, it soon got quite challenging because we had to have various libraries in say Ruby, in Python and Java all working together to preserve the corelation ID. So, as ingress came in, say NGINX, we marked it as this is user interaction 157. We need to make all of our services parsed the correlation ID down through the request handling so we could join these things up.

All distributed tracing is really is distributed logging with a correlation ID that allows you to say, oh request 157, it was handled by all these various services at various times. You can draw it in a graphic-like format. You can see in the bottom left there. It's very powerful. We had a couple of problems where we couldn't figure out what service was causing latency in our stack and when we put Zipkin in or we actually already had Zipkin there and enabled Zipkin, suddenly it became painfully obvious which service was causing a problem. We then started debugging that service and it was really, really good.

Right. Check my time. I got quite a bit of material, but I think I'll skip on some, but in terms of implementations, popular ones, Envoy is kind of a very popular data plane at the moment. It has come out Lyft. It's battle-tested. It's really good stuff. Istio seems to be the control plane of choice at least at the moment. One caveat I would mention is it is heavily stewarded by Google. If you look on the [CXC 00:37:25] groups, on CNCF. So take that as you will.

Google have very strong opinions as do Amazon, as do Azure. So, whether you dial into the Google and what Google model of IAM, for example, but Istio tends to be control plane and NGINX as I mentioned and Linkerd have got proof concepts to work with Istio or in the future something like Istio as well.

A couple of new ones popping up. There's Conduit, which is actually the second version of Linkerd. So Linkerd, the original founders of Linkerd were from Twitter. They had something called Fenego in the Twitter stack and they sort of took out the best bits of Fenego or the service mesh bits and put it into Linkerd. But the challenge they've had with Linkerd is it's a JVM-based product. It's a Scala-based product. The JVM is a little bit heavy at runtime for this kind of model where you want something sitting next to every service.

So, they have done a completely new open source build called Conduit. Conduit is based on Rust or Go 2.0 effectively. Everyone is jumping onto Rust now. Rust is supercool as a language. I'm trying to learn a little bit. Awesome bit of kit, but they have rewritten their proxy in Rust. It is super alpha, as in they're definitely saying don't use it in production, but it's an interesting one to follow.

Cilium are very interesting as well. Cilium have got a slightly different model. They are using Envoy to enforce quite a few interactions, but they're also using something called EBPF. EBPF is a kernel-based technology. At the moment, a lot of the interactions are being done by separate process, which is not as efficient as if we can push some of the service mesh functionality into the Linux kernel itself.

Think of EBPF as like Java getting compiled into byte code and running on a machine. You can compile seeing various languages into BPF and run it in the kernel itself. When you run stuff in the kernel, it is super fast compared to user space kind of stuff. But at the moment, Cilium are leveraging Envoy heavily. I think there's going to be a migration towards a bit of both, something going in the kernel and something going on externally as well. It is too early to tell who is going to be a winner to honest, but it's a very interesting space to keep up to date with.

A couple of other ones. If you're using HashiCorp Nomad, check out Nelson. They're using Envoy again, but it's skewed towards HashiCorp Nomad. So, I love my HashiCorp tech. I don't use Nomad to be honest very much, but if you're using that it's interesting. eBay allegedly are migrating to Envoy and they still I think, but at the moment, they have their own one called fabio which his really skewed towards AWS. So, if you're looking for a service mesh as AWS, have a look at this. It integrates with console and ELB and the Amazon API gateways, things like that. Oops.

So, putting it all together, time check, this is Istio. Istio, it's vision is to be an open platform to connect manage and secure services, both service to service and also messaging. I mentioned before, proxies are the data plane, how this technology actually does its actions. It uses the data plane. Istio is the control plane operating on the proxies. The proxies are in theory swappable, but bear in mind that Istio was built directly with Envoy. So, this is Matt Klein's kind of opinion.

There's a lot of tight integration between Istio and Envoy that you don't see at the moment with NGINX and with Linkerd. Not to say in the future that's going to change, but at the moment Istio is very tightly bundled with Envoy. Just bear that in mind. This is what you pretty much get. I think we'll look at the control plane first.

So, imagine we've got our services here A and B communicating. They're communicating via sidecars. We have this control plane and various different things in the control plane are responsible for different actions. So, if we're dialing in to the control plane, and I always get this wrong, but this it the pilot and this is the mixer.

The pilot is responsible for driving, flying, the Envoy proxies. So, it takes the instructions we have. It does some obstruction between what other platform we're running on. At the moment, it pretty much is just Kubernetes. There's a couple of other things coming in, but it provides the obstruction and it says whatever we specify in terms of routing rules, it will then parse on to Envoy.

The mixer is responsible for things like precondition checking, policy enforcement, quota management and also does the telemetry reporting as well. It obstructs away from the Prometheus, various different stacks of recording data.

So this Istio team very much define this as using the Unix or the Linux model, single responsibility principles. Do one thing, do one thing well. If you look back, the pilot drives the Envoys. The mixer is responsible for interactions in terms of telemetry and policy. The [OTH 00:42:43] section, the OTH module is responsible for issuing certificates to secure the traffic. Nicely modularized system.

There's always a control plane. Linkerd has a namerd. It's a command line control plane. NGINX, we just saw. Owen actually talked about this. NGINX is coming out with controller is their kind of control plane. So, you always need a control plane. You always have a control plane. A lot of us are used to being the control plane. We would go manually switch things and tweak things.

We would deploy a script into NGINX. We are effectively at the control plane then, but as systems get more complex, we need to we need better obstructions. More, more micro services, more, more functions as a service. We need tools like this to help us manage the scale and dynamic performance. So, watch this pace is what I'm saying there.

Dialing back at the data plane, the proxy itself, Envoy is super popular in this space. A lot of people are building on the Envoy technology. So, I'm working with a team called [inaudible 00:43:48] in Boston and we're building a API gateway or built I should say an API gateway on top of Envoy. So, if you want say your Envoy doing your service mesh east-west, you can have this doing your traffic north-south your ingress traffic.

Gloo is super interesting. So, he did it from Solo. He's basically creating a framework where you can repackage Envoys and create custom routing levels. It's like a hybrid between API gateways and service meshes. It came out like three weeks ago. It's super new, but the idea is very smart, very tuned into community. She's an expert in Unicanos and a bunch of other radical tech.

So, these technologies I think hint where we're going to go with service meshes. We're going to push some of it into the infrastructure fabric. Some of it I think will be when you, in the future, have GKE or hosted Kubernetes, I think you'll get some service mesh functionality as part of that. I think some companies are going to pull up as well into API gateways and you're going to get this cleaner level in the middle where our applications either interact with the gateway or interact with the mesh.

If you want to get started, I've put a bunch of references there. Ben Hall runs a website called Katacoda. If you haven't heard of Katacoda, it's amazing. You basically see a browser-driven thing. You log on and you can have a terminal and you can play around with Istio. There's a whole bunch of tech there. Ben is like awesome guy.

So often if I'm trying to learn a new tech and Ben has created a Katacoda tutorial, I pop onto the website. I think might have to register, I can't remember, but you can experiment all in the safety of a dock container running on Ben's hardware. So, he's got some great Istio tutorials to play around with it if you want to know more.

I'm pretty tight for time now, so I'll put the slides out, but just a couple of things that's probably worth mentioning in terms of if we're looking at use cases. So, what I would say is drawing a bit more on [inaudible 00:45:44] again, it's really hard often to debug micro services. There's many things. The interactions are more frequent. The interactions are out of process.

So, I no longer as a Java developer can say spin up my monolith and put breakpoints in, inspect variables, modify the data. It's really hard to do that in micro services because they might be multi-language and the communication is happening at a process.

So, it's gets written in open source in a project called Squash. I wrote up a talk actually in InfoQ and she's pitching this service meshes on the ideal hook to do debugging. She's got a proof concept that uses VS code. Is anyone familiar with VS code? A nice bit of kit from Microsoft. It's cross-language debugging. It's got GDB support. It's got Java support. It's got packing support.

It basically hooks into Envoy, and when it spots a request that's interested, you can effectively set a break point, switch in to a debugger if that's a Java service, switch into the debugger. Play around with the code and then I'm done, I parse on the request onto the next service. I think this is a really interesting bit of technology. It's very new, but service meshes are providing the hooks to understand the systems and to debug systems a little bit better.

I also think for migrations is super interesting. So, Christian has written a bit more about that. If you are looking to pull out functionality from a monolith and create a new service from it, you can use technology like this to effectively test the functionality that you've pulled out, does it match the existing functionality?

You can do that by routing traffic to both bits of functionality, only returning the old data, but doing a div basically between a monolith version of that service and our micro services version. Some cool frameworks I've used in the past like GitHub Scientist, which allows you to run experiments on different code path and Twitter's Diffy is very nice as well.

Just finally, chaos engineering, I haven't heard many talks at this conference actually, but many conferences I've been to in the last six months. I'm very lucky to go to a lot of conferences, but many conferences are talking about this chaos engineering principle that's coming out of primarily Netflix, but many people are jumping on board now. The only way you can really test the resilience of a system at scale and the complexity of something like micro services is to run experiments in production. Break stuff and see what happens.

Istio is jumping on that bandwagon by allowing you to manipulate traffic. You can drop requests. You can return 4F4s, return 500s and you can see either in staging or production, whatever your level of content is, you can see what happens when you do mess with the requests. So, that's kind of cool.

Oops, hope I'd done. Sorry guys. I've lost one last. Where has it gone? Sorry. [inaudible 00:48:44]. Cool. So, I'll just get my feeds. So, I've put all the ones up on that touch time there. So wrapping up. Hopefully, this has been a little bit of a whistle stop tour. As the joke goes, you can kind of watch my talk back on half speed on YouTube, if I have spoken a bit too fast. So, I'm totally cool with that. I'll not take offense. As a native speaker, I do talk too fast sometimes.

But hopefully, you've seen the service mesh holds a lot of promise for service to service communication. Even messaging in the future, I think Kafka in particular is becoming a target of service meshes. It's a way to homogenize all the communication we do, which is very attractive in a very dynamic, distributed and polyglot framework that many of us are working with micro services now.

We're moving from these complicated worlds to complex worlds and the technology typically in regards to communication hasn't always caught up. Kubernetes focus more on deployments than it did on runtime interaction for example. Kubernetes uses amazing bit of kits, but they had to prioritize where they worked on. So, service meshes are the missing piece in my mind of the dynamic interactions, the dynamic behaviors in your infrastructure.

It's a great place to hook in for observability, to do testing, debugging, all these things. Debugging micro services is crazy, crazy hard and I think service meshes could make this easy. It's early days. A lot of the stuff is coming out is open source, so we can all get involved, which I think is awesome.

Word of caution. This is proper, I like to say, hipster tech. I hope that translate okay over here. This is kind of hipster tech. I know a few clients. I've worked with a few clients using Linkerd in production. I've not seen Istio in production, for example. I'm working with clients that are probably doing a POC but it is very new tech. So, just take care with that kind of stuff. Read around. Look through it as a future thing, but always know the risks if you're going to use new technology.

I've worked over the last two years with many companies that have probably jumped onto the Kubernetes bandwagon a little bit too early. They've moved Amazon without realizing the impact of using cloud. So, be careful with these kind of things. Always use a value stream mapping. Understand where your bottlenecks are. The theory of constraints is super interesting. If you're not having problems with communications within your services, service mesh is probably not where you want to look.

If you're having problems with deployment, maybe look more at CICD technology that kind of thing and always justify, I know we're mostly here developers and engineers, but thinking more business terms like ROI, how much money am I going to get, things like total cost of ownership. If it's going to take me and my month to learn this new technology, that's a month's worth of work lost on something that could add business value. So, I'll just [inaudible 00:51:31] end the talk with that kind of caution on business stuff.

All right. This is my talk. One of [inaudible 00:51:36], of course but I've got to really shout out at these people. William from Buoyant, Owen, Christian, Matt, Shriram, Louis, Varun, people that I jumped on a call with or a chat on Skype and they were super helpful for helping me to understand this technology. So, I really I'm thankful for their time. This is very much appreciated. Well, on that note, thanks for listening. I appreciate. It's lunch.

Speaker 2: There is lunch, but we have time for some questions.

Speaker 3: Thank you very much for the talk, awesome. What is the service mesh that you recommend taking to consideration that I have, for example, 100 micro services. I want to have visibility about my sequel [inaudible 00:52:26], things like this.

Daniel Bryant: Yeah. So, I used to recommend Linkerd because Linkerd is the most battle-proven technology, but I'm a little bit concerned especially on camera with Conduit coming out. So, Buoyant have really used Conduit. They're clearly saying they're going to support both technologies. I'll leave that there for you.

I actually say at the moment, some of the more traditional APM tooling like New Relic and a bunch of other ones like that have dynamics are probably more suited to the observability in your case. They are definitely catching up with micro service ideas as well, like it took them a while, so they're very keen to work with people that are leverage. But my eye to the future would be something like Istio, but I actually had a requirement now for observability and I had the money in my pocket, I probably would look more towards APM stuff.

If I'm a startup with no ... I can take more risks perhaps with startups, I might go with Istio. Some of my clients who are startups are like, "I'm going to play with Istio because we've got hardly any traffic so who cares if it breaks, and when we go live in a year's time we'll be in prime position." But you are like an enterprise and you've got customers paying you money, it's quite a big gamble at the moment because don't forget every request goes through the service mesh.

So if a service mesh goes wrong ... Monzo, a U.K. challenger bank talked about this. They had a massive attitude with Linkerd and it was fast. They were super generous. So Monzo people are amazing. They talked about how made it even worse, then it rolled back. So Monzo are very innovative, very hipster, and they got away with it. Whereas many of us in enterprises, we can't do that. So, hopefully that's a lot of words. Feel free to chat me at lunch or whatever, but I hope that helps a little bit.

Speaker 3: Thank you very much.

Speaker 2: Are there any more questions? If ... Okay.

Speaker 4: Hi. Great talk. It seems that in the past we have talked about micro services that it takes away the barrier or it gives us a barrier that we have to talk to another service. Service meshes actually take away that barrier again. It makes it more easy to communicate to other services and it feels like that we can get again to that state where we just talk to other services like we used to talk to another function or another part within my code. How do you see that?

Daniel Bryant: That's a great question. So, one caveat I would always say is, and I've totally been burned by several times, never treat local calls and remote calls the same. I have made that mistake in call bar, in RMI. I've done that 70 times and I see this happening again particular with more genuine developers.

So, I agree with you. I think definitely with things like the logical naming, for example, you can say user service and it maps to the user service in staging or production or whatever. I'm still a little bit, I'm forming my thoughts around this to be honest. When I work with clients, I always say, we as developers we need to have mechanical sympathy. It's a phrase I use quite a lot. I borrowed that from Martin Thompson in that you need to understand just enough one level down.

So if I'm a Java developer, I need to know a little bit about Kubernetes, a little bit about Istio. When I'm making a remote call, it might be super easy using something like Istio in comparison with something like my own custom HTTP framework or something, but there's always a trade-off. That's a call. There's an exponential increase in the likelihood of that call going wrong compared to any process call. It makes sense?

So, I always work with teams and I say, hey mechanical sympathy. Understand the eight fallacies of distributed computing. Always known when you're making a local or remote call and try not to create frameworks that hide that. Anyone who has worked on EJBs, that local and remote kind of thing? Horrible, yeah. So, nice idea.

I think the reality is the mechanical sympathy means you have to understand what you're doing. But I do think Istio that's a little bit of a syntactic sugar effectively on top of these things that makes it a bit easier for us, which as long as we're aware of that stuff I think is cool.

Speaker 5: So, thanks, Daniel for the great talk. I have a question on the overhead. So, you mentioned already that we have an additional layer using service meshes. So, do you have any experience or is there any indication how much, how expensive this overhead actually is?

Daniel Bryant: Another great question. Thanks a lot. Yeah, it's a really good point in terms of there is more things intervolved in a typical request now because you go via the service mesh. Particularly, with something like Istio, it also often looks up in the mixer to see I'm making Service A making a request to service B. Am I allowed to do that?

Obviously, you can heavily cache those policy look ups. but there's often extra hops there, and there's extra hops potentially between the mesh. So, my advice is always to benchmark your system and see if it impacts you. Adrian Cockcroft, I'm very lucky to know Adrian. A bit of name dropping there, but I've chat to Adrian about this a few times and he is actually quite cautious about this kind of thing.

He's saying on a lot of systems, particularly say financial systems where we're talking about microseconds being super important. Anything that adds on time is actually a bad thing. So, not only have got the network call potentially extra involved there, but you've got something like Envoy doing a bit of processing as well.

So, I think Adrian has got a really good point and I respect Adrian a lot. I always listen to what he says. The kind of clients I have worked with are doing E-commerce systems and various other systems. For them, the benefit that something like a service mesh could provide in terms of understandability and observability I believe in general outweigh the costs you're going to pay with communication overhead, but it's something everyone has to decide.

The best way I think to do it to us is get down and have a think, what's more important to us, understandability and performance? For most people it's understandability. For banking and finance is probably performance to be honest, but make your decision. Run some simple experiments and benchmark.

The Conduit people, for example, have benchmarked. I think their P99 overhead, it was crazy. It was at 20 microseconds or something. It was very small, but they're obviously running that benchmark in very favorable conditions, to kind of blame them. So, they should be quite minimal in theory, but my advice is always check what works for you.

Speaker 5: Thanks.

Daniel Bryant: It's a good question. It was definitely.

Speaker 2: So, is there one last question? There is.

Speaker 6: Thanks for the great talk. One question I have is how you would go about to do a step by step integration of the service mesh into your system as like Istio works well with [non-part 00:59:22], let's say, which is not in the service mesh. Like you have old deployments of ports running in Kubernetes that don't have the sidecar. Does this work well together somehow?

Daniel Bryant: Yeah. That's a great question. [inaudible 00:59:34] when I've done the kind of pre-service mesh thing, so I was very lucky to work on micro services before it became micro services. I've worked on service meshes before they became service meshes. We often used it in migration. So, the SmartStack thing I mentioned, we have like a really monolith I think and some Java micro services. So, the communication [inaudible 00:59:55] more bits, only one part of the system used the service mesh. Yeah, it was fine because we're only interested in that bit.

But most of the people I see now looking at Istio, they're kind of going all in. The risk is obviously higher with doing that, but the trade-off is, if you're looking at something like distributed computing, you cannot have any areas of your system that are dark as they call it. Google have done an awesome paper called Dapper. Google have talked about how they created this Dapper framework, which is the parent of all the Zipkin framework. The Dapper paper is actually very readable. It's a very good paper.

They talked about in order for distributed tracing for observability and debuggability to work in this kind of context, your whole system has to be instrumented. Google are Google, so most of the time Google, but Google they are very lucky in that they use something Stubby, which is basically gRPC now, but they used Stubby and they used a bunch of other things that meant that they could put the instrumentation required in that library and in force everyone used it. But how often does that happen in the enterprise, let's be honest here? If I say you got to use my library, I say, "Ha ha."

I think Google and Twitters, the unicorns as we call then, they have different cultures than many of us do. I'm not saying they're better or worse, they're just different. So, to wrap up, my answer is probably the most benefit with modern service meshes is the whole thing, but just bear in mind the challenges you might have. You've got ESBs and MQs and you got like mainframes and all this stuff, it might not be possible to do all in one.

Speaker 6: Thanks.

Speaker 2: So, thank you very much, Daniel. Lunch time is now and we see the next talk at quarter to two. Thank you very much.