Deadlines, Distributed Timeouts, and Microservices

The timeout pattern prevents remote procedure calls from waiting indefinitely for a response. A timeout specifies how long the RPC has to return. If there is no response by the timeout, the RPC invokes a fallback mechanism, whether it’s retrying the request, throwing an exception, or something else. This pattern is perhaps the most basic, fundamental resilience pattern used in RPCs.

In a microservice architecture, a single request may consist of multiple RPC requests that chain together multiple services. So while timeouts can be set on a per-RPC basis, how do you set a timeout for the entire request and not just a single RPC?

You need to implement deadlines, or distributed timeouts.

With a deadline, the initial request to a service sets the timeout for the entire request. We call this a deadline: the calling service is specifying that the entire request chain needs to respond by the deadline, or else the calling service will fall back to an alternative strategy (e.g., retry the request). As the request is passed from service to service, each service looks at the deadline, sets its own local timeout for processing, and passes the remainder of the deadline to the next service in the chain. We can show this approach here:

deadlines graph

As shown above, in order for deadlines to work, every service that is called needs to respect a common convention for deadlines. This involves receiving the deadline, determining a local timeout, passing the deadline on to the next service, and enforcing the local timeout.

Microservices and deadlines

In a microservices architecture, deadlines solve two problems.

First, setting local timeouts is hard and frequently arbitrary. When service A calls service B, how does the developer know if it should respond within 0.1 seconds or 0.5 seconds? The deadline model simplifies this problem by letting developers specify the timeout at a request level. Since these requests are usually customer facing, it’s more intuitive to specify that a user will want a response back within X timeframe, or a call to the REST API needs to return within Y timeframe.

Second, deadlines add resilience. Imagine that we want a request to respond within two seconds to a user. Suppose, service A, which needs to call service B to respond to this request, accidentally calls service B in a loop, calling service B a hundred times instead of a single time. If service A uses a local timeout of 1.0 seconds, service A will take a maximum of a hundred seconds to execute, since each RPC call will take a maximum of 1.0 seconds. This obviously would exceed the desired two second threshold. With the deadline model, service A will allocate itself a specific period of time that is less than the total two second budget. If service A is unable to respond within its time allocation, it will abort instead of continuing with additional RPCs.

Deadlines are a powerful tool to improve the resilience of a microservices architecture. By setting deadlines, you can improve the end user experience even in the face of network outages or software bugs.

Questions?

We’re happy to help! Look for answers in the rest of the Microservices Architecture Guide, join our Gitter chat, send us an email at hello@datawire.io, or contact our sales team.