Innovative Resilience: Utilizing Exponential Backoff and Jitter for Retries under Load
This section delves into the intelligent use of retries with exponential backoff and jitter to handle network failures gracefully without overwhelming backend services. It offers practical patterns, real-world examples, and a comprehensive understanding of these techniques.
Innovative Resilience: Utilizing Exponential Backoff and Jitter for Retries under Load
In distributed systems, failure scenarios such as timeouts, HTTP 503 Service Unavailable responses, dropped packets, and throttled endpoints are inevitable, particularly as the system scales. The instinctive reaction might be to immediately attempt a retry. However, a more strategic approach involves careful retries, progressively increasing delay between these retries, and introducing a degree of randomness into the process.
The strategy of exponential backoff combined with jitter sets the stage for this careful and strategic handling of failures. It enables clients to retry failed requests in a gradual manner, effectively avoiding overwhelming an already strained service with a retry storm.
This deep dive will explore:
- The dangers of naive retries
- The concept of exponential backoff
- The role of jitter in preventing overload
- Practical JavaScript patterns for implementing retries
- Real-world examples from tech giants like Google, AWS, and Stripe
The Logic of Retries
In certain situations, retrying is the most logical approach:
- A flaky mobile network leading to a timeout
- A server temporarily returning HTTP 500 Internal Server Error or HTTP 429 Too Many Requests
- A dropped connection during a fetch operation
- A momentarily full queue
However, immediate or simultaneous retries from multiple clients, also known as a thundering herd, can lead to cascading failures.
The Art of Exponential Backoff
Exponential backoff is a strategy that involves progressively increasing the wait time between retries. More specifically, it increases this delay geometrically, as demonstrated in the following JavaScript function:
const delay = (attempt) => Math.pow(2, attempt) * 100;
The retry timeline with exponential backoff would look something like this:
- Retry 1 → wait 100ms
- Retry 2 → wait 200ms
- Retry 3 → wait 400ms
- Retry 4 → wait 800ms
By implementing exponential backoff, you provide your servers with the necessary breathing room to recover, thereby reducing the retry load.
Enhancing Safety with Jitter
Despite the benefits of exponential backoff, if all clients retry at the same intervals, the system can still be overloaded. This is where jitter comes into play, adding a level of randomness to the retry delays:
function backoffWithJitter(attempt) {
const base = Math.pow(2, attempt) * 100;
return base / 2 + Math.random() * (base / 2);
}
Jitter effectively disperses the retry waves, resulting in smoother traffic flow. There are several common jitter modes:
- Full jitter:
Math.random() * base
- Equal jitter:
base/2 + Math.random() * base/2
- Decorrelated jitter:
min(cap, random_between(base, prev_delay * 3))
(used by AWS)
Practical Retry Pattern in JavaScript
Here's an example of a practical JavaScript function that implements retries using exponential backoff and jitter, and can be used for GET requests, polling operations, and load-on-mount queries:
async function fetchWithRetry(url, retries = 5) {
for (let attempt = 0; attempt < retries; attempt++) {
try {
const res = await fetch(url);
if (!res.ok) throw new Error("Failed");
return await res.json();
} catch (e) {
const delay = backoffWithJitter(attempt);
await new Promise((r) => setTimeout(r, delay));
}
}
throw new Error("Retries exhausted");
}
Available Retry Libraries
There are several libraries that provide built-in support for retries with backoff and/or jitter:
- retry (npm): This library provides retry decorators and policies.
- axios-retry: This library plugs into Axios to provide retry functionality.
- React Query: This library includes built-in support for retries with backoff.
- SWR: This library supports retries with exponential delay through its
onErrorRetry
function.
When Retries Should Be Avoided
Retries should ideally be avoided for:
- Mutations with side effects such as payments or sending data.
- User-generated form submissions without providing user feedback.
- Hard failures such as HTTP 401 Unauthorized, 403 Forbidden, and 404 Not Found; these are not likely to be resolved by simply retrying.
In essence, retries should only be used when the failure is likely to be transient.
Real-World Examples of Retry Strategies
AWS SDK
The AWS SDK employs capped exponential backoff combined with jitter for retries. It also provides configurable retry policies for every SDK.
Stripe
Stripe retries HTTP 429 and 500 errors with increasing delays. However, their documentation also provides warnings about which endpoints are safe or unsafe to retry.
Google APIs
Google APIs retry HTTP 5xx errors and rate limit errors using full jitter. They also provide Retry-After
headers to control backoff.
Retry Anti-Patterns
Certain practices should be avoided when implementing retries:
- Retrying without delay, as this can overload the backend.
- Using a fixed delay for retries, as it can lead to synchronization storms.
- Retrying indefinitely, as it can cause cascading failures.
- Ignoring
Retry-After
headers, which are intended to control retries. - Failing silently after retries, without providing any feedback.
Conclusion: Fail Slowly, Recover Gracefully
Retrying isn't just a fallback mechanism — it's a well-thought-out strategy. It's about acknowledging the fact that failures do occur and planning for those failures in a way that doesn't contribute to the chaos. Exponential backoff ensures that retries are conducted in a careful and measured manner, while jitter introduces randomness to prevent synchronization issues. Together, they enable your frontend to be a responsible participant in a distributed environment.
In the face of adversity, the way you choose to retry can make a significant difference.
Subscribe for Deep Dives
Get exclusive in-depth technical articles and insights directly to your inbox.
Related Posts
Timeouts — Mastering Network Resilience and Responsiveness
In this article, we delve into the significance of timeout strategies in frontend networking, exploring how to efficiently cancel slow requests, provide early error feedback, and construct more robust and responsive web applications.
Delving into Server-Sent Events (SSE): Unidirectional Data Streaming Made Simple
A comprehensive exploration of Server-Sent Events (SSE) — a straightforward approach to real-time data transmission from server to browser over HTTP. We'll explore how it operates, its ideal use cases, and how it stands against other technologies like WebSockets and polling.
RESTful APIs: A Deep Dive into the Backbone of Web Data Exchange
An in-depth exploration of RESTful APIs, their importance, operational mechanism, and how modern frontend applications optimally consume them. Discover conventions, industry applications, caching, pagination, and prime practices for scalable API design.