Graceful Request Retries in Ruby Applications
Failure Management, Fallbacks, Exponential backoff, Tools and Patterns
When designing modern applications on microservices architecture and systems based on cloud solutions such as AWS, Azure, or Google Cloud imply the need to handle expected failures.
How to handle failures?
- Restarting the code in the current thread runtime
- Retry execute in background jobs
Retry failed code
We can represent the simplest retry in runtime as a rescue block and variable with a count of executed retries.
Also Ruby has built-in
retry keyword, an example can be changed in the next way:
Some best practices related to exceptions:
- Any application-level error should be a child of
StandardError, for e.g.
ApiError = Class.new(StandardError)
- List of handled errors should be specified in rescue, or at least
rescue StandardError => e
- Each code block retry needs to be logged with attributes and retries count
Retriable – is a simple DSL to retry failed code blocks with randomized exponential backoff.
Code in a
Retriable.retriable block will be retried if an exception is raised.
- Rescue any exception inherited from
- Make 3 tries (including the initial attempt) before raising the exception
- Use randomized exponential backoff to calculate each succeeding try interval
Exponential backoff is a common algorithm for retrying requests. The retries exponentially increase the waiting time up to a certain threshold. The idea is that if the server is down temporarily, it is not overloaded with requests going at the same time when it comes back up.
Also, gem provides configurations for a specific context. A number of retries, list of exceptions can be specified for internal APIs, cloud services such as AWS and etc.
These are used simply by calling
Unfortunately, gem doesn’t provide an interface for fallbacks, so you implement it by yourself.
Redis-backed tools such as Sidekiq and Resque, provides a configuration for setting up retries jobs count in case of an exception.
Unfortunately, errors handling in background jobs are global, so retry can be related not only to request. Try to move API requests to separate jobs that are not related to other logic and set limit retries for them. For example, Sidekiq makes 25 retires for the failed job (about 21 days), in most cases when working with HTTP services it doesn’t make sense.
Do not use retrying in a code which is running in background jobs because they can multiply retry count.
Most ruby tools for working with APIs provide a built-in interface for retrying related to specific service, use them instead of writing your own wrapper.
Let’s look at a good example
aws-sdk-s3. Each service error handled on the API wrapper level.
Follow this approach when writing your API client for internal and external services.
Application depends on numerous components in a network, such as DNS servers, switches, load balancers, and others can generate errors anywhere in the request lifecycle. So, in modern application architecture when working with external/internal services using graceful retries with exponential backoff is a must.