Microservices, Observability, and Groundhog Day from the Microservices.com Practitioner Summit

This is a guest post post by Kevin Marks summarizing the recent Microservices.com Practitioner Summit, hosted by Datawire in San Francisco on January 31, 2017.

Susan Fowler and I share a favorite movie: Groundhog Day. There’s something magical about seeing Bill Murray’s transformation from a surly misanthrope to a pillar of the community that lifts the heart every February 2nd.

I got that feeling two days early this year at the Microservice Practitioners Summit with the transformative power of observability being a common thread we could trace through the talks.

Matt Klein of Lyft said:

“When I joined Lyft people were actually afraid of making service calls as they couldn’t know what went wrong. You have limited visibility into different vendors logging and tracing models, so there is little trust.”

He said that SoA has historically had the ability to debug across services, so productivity was low and because of the lack of trust in networked calls people rebuild monoliths with fixed libraries instead. By building Envoy to route all traffic between services, they can sample and trace entire request chains.

“Because we have a stable requestID, we can trace and log across multiple systems and servers, so you can have a dashboard that shows all connections between any 2 services, and look at any 2 hops in the system and how they relate. Being able to reason about where time is spent really matters.”

Varun Talwar’s work on gRPC at Google has similar goals – he described how attaching metadata for authorization, trace contexts and even client type to the RPC calls enables deep observability and introspection.

“You can go to any service endpoint and see a browser dashboard in real time on how much traffic is flowing. You often have 1 query out of 10,000 that is slow – you want to trace it through the whole call chain. You also want to look at aggregate info to see where the hotspots are.”

This ability to retrace the entire path of causation reminds me of the “I am a God” scene from Groundhog Day, where Phil tells Rita about every person in the room and their history and says “Maybe God isn’t omnipotent. Maybe he’s just been around so long, he knows everything.”

Josh Holtzman built in omniscient tracking at Xoom.com when they were introducing microservices:

“We built a time series for every endpoint and call. We were very worried about performance when we started on this journey – we were worried about extra net traffic. So we spent a lot of time instrumenting our code before we made any changes, and I recommend that. We improved the throughput of our service dramatically, primarily because of the shift to accurate monitoring. The key is to measure everything, and be prepared to scale monitoring to cope.”

Once you have this level of understanding of cause and effect, what do you do? You work to improve things. As Rafi Schloming of datawire.io puts it:

“Development is frequent small changes with quick feedback and measureable impact at each step, so microservices are a developmental methodology for systems, rather than an architectural one. Small frequent changes and rapid feedback and visibility are given for a codebase, but harder for a whole system. Microservices are a way to gather rapid feedback – not just tests but live measurement.”

Rafi would build, test assess impact and deploy fixes incrementally, improving the system while it was running and intervening to head off failure, making this routine.

Think of the last act of Groundhog Day, where Phil is able to save people from flat tires, choking on food, and falling out of trees as a matter of routine.

Addressing these gaps is why Susan Fowler built standardized infrastructure at Uber:

“People will move between teams and leave old services running – no-one wants to clean up the old stuff. There are complex dependency chains that you can’t know are reliable. You don’t know that your dependencies will work, or that your clients won’t overload you.”

What is needed fault tolerance and catastrophe preparedness – the need to withstand both internal and external failure modes. Continually probing the system and seeing the consequences can be done when you have the confidence that you can bring things back online, like Phil risking death but awaking every day again at 6am with the knowledge from the previous day’s failures and successes. Once you have monitoring and documentation standards so you know the state of the system, and very good logging to see bugs you can do as Susan says:

“Try every failure mode you can think of – push it to that mode in production and see how it does fail in practice.”

The key to success and productive use of microservices is having this global visibility into your system, so that you don’t spend six months learning how to throw cards into a hat more accurately, but target the changes and improvements so you really make a difference to everyone else’s experience of the system.