Envoy Proxy 101: What it is, and why it matters?

Envoy Proxy is a modern, high performance, small footprint edge and service proxy. Envoy is most comparable to software load balancers such as NGINX and HAProxy. Originally written and deployed at Lyft, Envoy now has a vibrant contributor base and is an official Cloud Native Computing Foundation project.

    Part 2: Deploying Envoy with a Python Flask webapp and Kubernetes

    In the first post in this series, Getting Started with Lyft Envoy for microservice resilience, we explored Envoy a bit, dug into a bit of how it works, and promised to actually deploy a real application using Kubernetes, Postgres, Flask, and Envoy. This time around we’ll make good on that promise.

    The Application

    What we’re going to do in this tutorial is to deploy a very, very simple REST-based user service: it can create users, read information about a user, and process simple logins. Obviously this isn’t terribly interesting by itself, but it brings several real-world concerns together:

    • It requires persistent storage, so we’ll have to tackle that early.
    • It will let us explore scaling the different pieces of the application.
    • It will let us explore Envoy at the edge, where the user’s client talks to our application, and
    • It will let us explore Envoy internally, brokering communications between the various parts of the application.

    Since Envoy is language-agnostic, we can use anything we like for the service itself. For this serious, we’ll pick on Flask, both because it’s simple and because I like Python. On the database side, we’ll use PostgreSQL – it has good Python support, and it’s easy to get running both locally and in the cloud. And we’ll manage the whole thing with Kubernetes.

    Kubernetes

    Kubernetes is Datawire’s go-to container orchestrator these days, mostly because it does a fairly good job of letting you use the same tools whether you’re doing local development or deploying into the cloud for production. To get rolling today, we’ll need a Kubernetes cluster in which we’ll work. Within our cluster, we’ll create deployments that run the individual pieces of our application, and then expose services provided by those deployments (and when we do that, we get to decide whether to expose the service to the world outside the cluster, or only to other cluster members).

    We’ll start out using Minikube to create a simple Kubernetes cluster running locally. The existence of Minikube is one of the things I really like about Kubernetes – it gives me an environment that’s almost like running Kubernetes somewhere out in the cloud, but it’s entirely local and (with some care) it can keep working at 30000 feet on an airplane with no WiFi.

    Note, though, that I said almost like running in the cloud. In principle, Kubernetes is Kubernetes and where you’re running doesn’t matter. In reality, of course, it does matter: networking, in particular, is something that ends up varying a bit depending on how you’re running your cluster. So getting running in Minikube is a great first step, but we’ll have to be aware that things will probably break a little bit as we move into the cloud.

    Setting Up

    Minikube

    Of course you’ll need Minikube installed. See https://github.com/kubernetes/minikube/releases for more here. Mac users might also consider

    brew cask install minikube
                    

    Once Minikube is installed, you’ll need to start it. Mac users may want the xhyve driver to avoiding needing to install VirtualBox:

    minikube start --vm-driver xhyve
                    

    Alternately

    minikube start
                    

    will fire things up with the default driver.

    Kubernetes

    To be able to work with Minikube, you’ll need the Kubernetes CLI, kubectl. Instructions are at https://kubernetes.io/docs/user-guide/prereqs/ — or, on a Mac, just use

    brew install kubernetes-cli
                    

    Docker

    You'll also need the Docker CLI, docker. Check out the Docker Community Edition if you're just getting started -- or, again, on a Mac use brew:

    brew install docker
                    

    The Application

    All the code and configuration we’ll use in this demo is in GitHub at

    https://github.com/datawire/envoy-steps

    Grab a clone of that, and cd into it. If you’re in the right place, you’ll see a README.md and directories named postgres, usersvc, etc. Each of the directories is for a Kubernetes deployment, and each can be brought up or down independently with

    bash up.sh $service
                    

    or

    bash down.sh $service
                    

    Obviously I’d prefer to simply include everything you need in this blog post, but between Python code, all the Kubernetes config, docs, etc, there’s just too much. So we’ll hit the highlights here, and you can look at the details to your heart’s content in your clone of the repo.

    The Docker Registry

    Minikube starts a Docker daemon when it starts up, which we'll need to use it for our Docker image builds so that the Minikube containers can load our images. To set up your Docker command-line tools for that:

    eval $(minikube docker-env)
                    

    One of the benefits of Minikube is that a container can always pull your images from the local Docker daemon started by Minikube, so we don't need to push Docker images to any registry -- just building them using the Minikube Docker daemon is good enough. To tell the scripting we'll be using that we're using Minikube and nothing more is needed, run

    bash prep.sh -
                    

    If you want to reset to a pristine condition later, you can use

    bash clean.sh
                    

    Database Matters

    Our database can be really simple — we just need a single table to store our user information. We can start by writing the Flask app to check at boot time and create our table if it doesn’t exist, relying on Postgres itself to make sure that only one table ever exists. (Later, as we look into multiple Postgres servers, we may need to change this — but let’s keep it simple for now.)

    So the only thing we really need is a way to spin up a Postgres server in our Kubernetes cluster. Fortunately there’s a published Postgres 9.6 Docker image readily accessible, so creating the Postgres deployment is pretty easy. The relevant config file is postgres/deployment.yaml, which includes in its spec section the specifics of the image we’ll use:

    spec:
                      containers:
                      - name: postgres
                        image: postgres:9.6
                    

    Given the deployment, we also need to expose the Postgres service within our cluster. That’s defined in postgres/service.yaml with highlights:

     spec:
                      type: ClusterIP
                      ports:
                      - name: postgres
                        port: 5432
                      selector:
                        service: postgres
                    

    Note that we mark this with type ClusterIP, so that it can be seen only within the cluster.

    To fire this up, just run

    bash up.sh postgres
                    

    Once that’s done, kubectl get pods should show the postgres pod running:

    NAME                       READY  STATUS   RESTARTS AGE
                    postgres-1385931004-p3szz  1/1    Running  0        5s
                    

    and kubectl get services should show its service:

    NAME      CLUSTER-IP     EXTERNAL-IP  PORT(S)   AGE
                    postgres  10.107.246.55  <none>       5432/TCP  5s
                    

    So we now have a running Postgres server, reachable from anywhere in the cluster at postgres:5432.

    The Flask App

    Our Flask app is really simple: basically it just responds to PUT requests to create users, and GET requests to read users and respond to health checks. You can see it in full in the GitHub repo.

    The only real gotcha is that by default, Flask will listen only on the loopback address, which will prevent any connections from outside the Flask app’s container. We set the Flask app to explicitly listen on 0.0.0.0 instead, so that we can actually speak to it from elsewhere (whether from in the cluster or outside).

    To get the app running in Kubernetes, we’ll need a Docker image that contains our app. We’ll build this on top of the lyft/envoy image, since we already know we’re headed for Envoy later — thus our Dockerfile (sans comments) ends up looking like this:

    FROM lyft/envoy:latest
                    RUN apt-get update && apt-get -q install -y
                        curl
                        python-pip
                        dnsutils
                    WORKDIR /application
                    COPY requirements.txt .
                    RUN pip install -r requirements.txt
                    COPY service.py .
                    COPY entrypoint.sh .
                    RUN chmod +x entrypoint.sh
                    ENTRYPOINT [ "./entrypoint.sh" ]
                    

    We’ll build that into a Docker image, then fire up a Kubernetes deployment and service with it. The deployment, in usersvc/deployment.yaml, looks basically the same as the one for postgres, just with a different image name:

    spec:
                      containers:
                      - name: usersvc
                        image: usersvc:step1
                    

    Likewise, usersvc/service.yaml is much like its postgres sibling, but we’re using type LoadBalancer to indicate that we want the service exposed to users outside the cluster:

    spec:
                      type: LoadBalancer
                      ports:
                      - name: usersvc
                        port: 5000
                        targetPort: 5000
                      selector:
                        service: usersvc
                    

    It may seem odd to be starting with LoadBalancer here — after all, we want to use Envoy to do load balancing, right? The point is walking before running: our first test will be to talk to our service without Envoy, and for that we need to expose the port to the outside world.

    To build the Docker image and crank up the service, run

    bash up.sh usersvc
                    

    At this point, kubectl get pods should show both the usersvc pod and the postgres pod running:

    NAME                       READY  STATUS   RESTARTS AGE
                    postgres-1385931004-p3szz  1/1    Running  0        5m
                    usersvc-1941676296-kmglv   1/1    Running  0        5s
                    

    First Test!

    And now for the moment of truth: let’s see if it works without Envoy before moving on! This will require us to get the IP address and mapped port number for the usersvc service. Since we’re using Minikube, we use

    minikube service --url usersvc
                    

    to get a neatly-formed URL to our usersvc. (Obviously, this will change when we move beyond Minikube.)

    Let’s start with a basic health check using curl from the host system, reaching into the cluster to the usersvc, which in turn is talking within the cluster to postgres:

    curl $(minikube service --url usersvc)/user/health
                    

    If all goes well, the health check should return something like

    {
                      "hostname": "usersvc-1941676296-kmglv",
                      "msg": "user health check OK",
                      "ok": true,
                      "resolvedname": "172.17.0.10"
                    }
                    

    Next up we can try saving and retrieving a user:

    curl -X PUT -H "Content-Type: application/json" \
                         -d '{ "fullname": "Alice", "password": "alicerules" }' \
                         $(minikube service --url usersvc)/user/alice
                    

    This should give us a user record for Alice, including her UUID but not her password:

    {
                      "fullname": "Alice",
                      "hostname": "usersvc-1941676296-kmglv",
                      "ok": true,
                      "resolvedname": "172.17.0.10",
                      "uuid": "44FD5687B15B4AF78753E33E6A2B033B"
                    }
                    

    If we repeat it for Bob, we should get much the same:

    curl -X PUT -H "Content-Type: application/json" \
                         -d '{ "fullname": "Bob", "password": "bobrules" }' \
                         $(minikube service --url usersvc)/user/bob
                    

    Note, of course, that Bob should have a different UUID:

    {
                      "fullname": "Bob",
                      "hostname": "usersvc-1941676296-kmglv",
                      "ok": true,
                      "resolvedname": "172.17.0.10",
                      "uuid": "72C77A08942D4EADA61B6A0713C1624F"
                    }
                    

    Finally, we should be able to read both users back (again, minus passwords!) with

    curl $(minikube service --url usersvc)/user/alice
                    curl $(minikube service --url usersvc)/user/bob
                    

    Enter Envoy

    Given that all of that is working (whew!)… it’s time to stick Envoy in front of everything, so it can manage routing when we start scaling the front end. As we discussed in the previous article, this means that we have an edge Envoy and an application Envoy, each of which needs is own configuration. We’ll crank up the edge Envoy first.

    Since the edge Envoy runs in its own container, we’ll need a separate Docker image for it. Here’s the Dockerfile:

    FROM lyft/envoy:latest
                    RUN apt-get update && apt-get -q install -y
                        curl
                        dnsutils
                    COPY envoy.json /etc/envoy.json
                    CMD /usr/local/bin/envoy -c /etc/envoy.json
                    

    which is to say, we take lyft/envoy:latest, copy in our own Envoy config, and start Envoy running.

    Our edge Envoy’s config is fairly simple, too, since it only needs to proxy any URL starting with /user to our usersvc. Here’s how you set up virtual_hosts for that:

    "virtual_hosts": [
                      {
                        "name": "service",
                        "domains": [ "*" ],
                        "routes": [
                          {
                            "timeout_ms": 0,
                            "prefix": "/user",
                            "cluster": “usersvc"
                          }
                        ]
                      }
                    ]
                    

    and here’s the related clusters section:

    "clusters": [
                      {
                        "name": “usersvc”,
                        "type": "strict_dns",
                        "lb_type": "round_robin",
                        "hosts": [
                          {
                            "url": “tcp://usersvc:80”
                          }
                        ]
                      }
                    ]
                    

    Note that we’re using strict_dns, which means that we’re relying on every instance of the usersvc appearing in the DNS. We’ll find out if this actually works shortly!

    As usual, you can get the edge Envoy running with a single command:

    bash up.sh edge-envoy
                    

    Sadly we can’t really test anything yet, since the edge Envoy is going to try to talk to application Envoys that aren’t running yet.

    App Changes for Envoy

    Once the edge Envoy is running, we need to switch our Flask app to use an application Envoy. We needn’t change the database at all, but the Flask app needs a few tweaks:

    • We need to have the Dockerfile copy in an Envoy config file.
    • We need to have the entrypoint.sh script start Envoy as well as the Flask app.
    • While we’re at it, we can switch back to having Flask listen only on the loopback interface, and
    • We’ll switch the service from a LoadBalancer to a ClusterIP.

    The effect here is that we’ll have a running Envoy through which we can talk to the Flask app — but also that Envoy will be the only way to talk to the Flask app. Trying to go direct will be blocked in the network layer.

    The application Envoy’s config, while we’re at it, is very similar to the edge Envoy’s. The listeners section is actually identical, and the clusters section nearly so:

    "clusters": [
                      {
                        "name": “usersvc”,
                        "type": "static",
                        "lb_type": "round_robin",
                        "hosts": [
                          {
                            "url": “tcp://127.0.0.1:80”
                          }
                        ]
                      }
                    ]
                    

    Basically we just use a static single-member cluster, with only localhost listed.

    All the changes to the Flask side of the world can be found in the usersvc2 directory, which is literally a copy of the usersvc directory with the changes we discussed above for the Flask side of the world (and it tags its image usersvc:step2 instead of usersvc:step1). We need to drop the old usersvc:

    bash down.sh usersvc
                    

    and then bring up the new one:

    bash up.sh usersvc2
                    

    Second Test!

    Once all that is done, voilà: you should be able to retrieve Alice and Bob from before:

    curl $(minikube service --url edge-envoy)/user/alice
                    curl $(minikube service --url edge-envoy)/user/bob
                    

    …but note that we’re using the edge-envoy service here, not the usersvc, which means that we are indeed talking through the Envoy mesh! In fact, if you try talking directly to usersvc, it will fail: that’s part of how we can be sure that Envoy is doing its job.

    Scaling the Flask App

    One of the promises of Envoy is helping with scaling applications. Let’s see how well it handles that by scaling up to multiple instances of our Flask app:

    kubectl scale --replicas=3 deployment/usersvc
                    

    Once that’s done, kubectl get pods should show more usersvc instances running:

    NAME                         READY STATUS   RESTARTS  AGE
                    edge-envoy-2874730579-7vrp4  1/1   Running  0         3m
                    postgres-1385931004-p3szz    1/1   Running  0         5m
                    usersvc-2016583945-h7hqz     1/1   Running  0         6s
                    usersvc-2016583945-hxvrn     1/1   Running  0         6s
                    usersvc-2016583945-pzq2x     1/1   Running  0         3m
                    

    and we should then be able to see curl getting routed to multiple hosts. Try running

    curl $(minikube service --url edge-envoy)/user/health
                    

    multiple times, and look at the hostname element. It should be cycling across our three usersvc nodes.

    But it’s not. Uhoh. What’s going on here?

    Remembering that we’re running Envoy in strict_dns mode, a good first check would be to look at the DNS. We can do this by running nslookup from inside the cluster. Specifically, we can use a usersvc pod:

    kubectl exec usersvc-2016583945-h7hqz /usr/bin/nslookup usersvc
                    

    (Make sure to use one of your pod names when you run this! Just pasting the line above is extremely unlikely to work.)

    Running this check, we find that only one address comes back — so Envoy’s DNS-based service discovery simply isn’t going to work. Envoy can’t round-robin among our three service instances if it never hears about two of them.

    The Service Discovery Service

    What’s going on here is that Kubernetes puts each service into its DNS, but it doesn’t put each service endpoint into its DNS — and we need Envoy to know about the endpoints in order to load-balance. Thankfully, Kubernetes does know the service endpoints for each service, and Envoy knows how to query a REST service for discovery information. We can make this work with a simple Python shim that bridges from the Envoy “Service Discovery Service” (SDS) to the Kubernetes API.

    (There’s also the Istio project, which is digging into a more full-featured solution here. Istio is still in its very early stages, though, so we’re going to stick with the simple way here.)

    Our SDS is in the usersvc-sds directory. It’s pretty straightforward: when Envoy asks it for service information, it uses the requests Python module to query the Kubernetes endpoints API, and reformats the results for Envoy. The most bizarre bit might be the token it reads at the start: Kubernetes is polite enough to install an authentication token on every container it starts, precisely so that this sort of thing is possible.

    We also need to modify the edge Envoy’s config slightly: rather than using strict_dns mode, we need sds mode. In turn, that means we have to define an sds cluster (which uses DNS to locate its server at the moment — we may have to tackle that later, too, as we scale the SDS out!):

    "cluster_manager": {
                      "sds": {
                        "cluster": {
                          "name": "usersvc-sds",
                          "connect_timeout_ms": 250,
                          "type": "strict_dns",
                          "lb_type": "round_robin",
                          "hosts": [
                            {
                              "url": "tcp://usersvc-sds:5000"
                            }
                          ]
                        },
                        "refresh_delay_ms": 15000
                      },
                      "clusters": [
                        {
                          "name": "usersvc",
                          "connect_timeout_ms": 250,
                          "type": "sds",
                          "service_name": "usersvc",
                          "lb_type": "round_robin",
                          "features": "http2"
                        }
                      ]
                    }
                    

    Look carefully: the sds cluster is not defined inside the clusters dictionary, but as a peer of clusters. Its value is a cluster definition, though. Once the sds cluster is defined, you can simply say "type": "sds" in a service cluster definition, and delete any hosts array for that cluster.

    The edge-envoy2 directory has everything set up for an edge Envoy running this config. So let’s crank up the SDS, then down the old edge Envoy and fire up the new:

    bash up.sh usersvc-sds
                    bash down.sh edge-envoy
                    bash up.sh edge-envoy2
                    

    and now, repeating our health check really should show you round-robining around the hosts. But, of course, asking for the details of user alice should always give the same results, no matter which host does the database lookup:

    curl $(minikube service --url edge-envoy)/user/alice
                    

    If you repeat that a few times, the host information should change, but the user information should not.

    Up Next

    We have everything working, including using Envoy to handle round-robining traffic between our several Flask apps. With kubectl scale, we can easily change the number of instances of Flask apps we’re running. And, as you probably noticed, bringing Envoy in once our app was running wasn’t hard. It’s a pretty promising way to add a lot of flexibility without a lot of pain.

    Next up: Google Container Engine and AWS. One of the promises of Kubernetes is being able to easily get stuff deployed in multiple environments, so we’re going to see whether that actually works.