Graceful Request Retries in Ruby Applications

Failure Management, Fallbacks, Exponential backoff, Tools and Patterns

2 min readJun 22, 2020

When designing modern applications on microservices architecture and systems based on cloud solutions such as AWS, Azure, or Google Cloud imply the need to handle expected failures.

How to handle failures?

Restarting the code in the current thread runtime
Retry execute in background jobs

Retry failed code

We can represent the simplest retry in runtime as a rescue block and variable with a count of executed retries.

Also Ruby has built-in retry keyword, an example can be changed in the next way:

Some best practices related to exceptions:

Any application-level error should be a child of StandardError, for e.g. ApiError = Class.new(StandardError)
List of handled errors should be specified in rescue, or at least rescue StandardError => e
Each code block retry needs to be logged with attributes and retries count

Tools

Retriable – is a simple DSL to retry failed code blocks with randomized exponential backoff.

Code in a Retriable.retriable block will be retried if an exception is raised.

Defaults

By default, Retriable will:

Rescue any exception inherited from StandardError
Make 3 tries (including the initial attempt) before raising the exception
Use randomized exponential backoff to calculate each succeeding try interval

Exponential backoff is a common algorithm for retrying requests. The retries exponentially increase the waiting time up to a certain threshold. The idea is that if the server is down temporarily, it is not overloaded with requests going at the same time when it comes back up.

Also, gem provides configurations for a specific context. A number of retries, list of exceptions can be specified for internal APIs, cloud services such as AWS and etc.

These are used simply by calling Retriable.with_context:

Unfortunately, gem doesn’t provide an interface for fallbacks, so you implement it by yourself.

Background Jobs

Redis-backed tools such as Sidekiq and Resque, provides a configuration for setting up retries jobs count in case of an exception.

Unfortunately, errors handling in background jobs are global, so retry can be related not only to request. Try to move API requests to separate jobs that are not related to other logic and set limit retries for them. For example, Sidekiq makes 25 retires for the failed job (about 21 days), in most cases when working with HTTP services it doesn’t make sense.

Do not use retrying in a code which is running in background jobs because they can multiply retry count.

Error Handling

Most ruby tools for working with APIs provide a built-in interface for retrying related to specific service, use them instead of writing your own wrapper.

Let’s look at a good example aws-sdk-s3. Each service error handled on the API wrapper level.

Follow this approach when writing your API client for internal and external services.

Conclusion

Application depends on numerous components in a network, such as DNS servers, switches, load balancers, and others can generate errors anywhere in the request lifecycle. So, in modern application architecture when working with external/internal services using graceful retries with exponential backoff is a must.