Innovative Resilience: Utilizing Exponential Backoff and Jitter for Retries under Load

In distributed systems, failure scenarios such as timeouts, HTTP 503 Service Unavailable responses, dropped packets, and throttled endpoints are inevitable, particularly as the system scales. The instinctive reaction might be to immediately attempt a retry. However, a more strategic approach involves careful retries, progressively increasing delay between these retries, and introducing a degree of randomness into the process.

The strategy of exponential backoff combined with jitter sets the stage for this careful and strategic handling of failures. It enables clients to retry failed requests in a gradual manner, effectively avoiding overwhelming an already strained service with a retry storm.

This deep dive will explore:

The dangers of naive retries
The concept of exponential backoff
The role of jitter in preventing overload
Practical JavaScript patterns for implementing retries
Real-world examples from tech giants like Google, AWS, and Stripe

The Logic of Retries

In certain situations, retrying is the most logical approach:

A flaky mobile network leading to a timeout
A server temporarily returning HTTP 500 Internal Server Error or HTTP 429 Too Many Requests
A dropped connection during a fetch operation
A momentarily full queue

However, immediate or simultaneous retries from multiple clients, also known as a thundering herd, can lead to cascading failures.

The Art of Exponential Backoff

Exponential backoff is a strategy that involves progressively increasing the wait time between retries. More specifically, it increases this delay geometrically, as demonstrated in the following JavaScript function:

const delay = (attempt) => Math.pow(2, attempt) * 100;

The retry timeline with exponential backoff would look something like this:

Retry 1 → wait 100ms
Retry 2 → wait 200ms
Retry 3 → wait 400ms
Retry 4 → wait 800ms

By implementing exponential backoff, you provide your servers with the necessary breathing room to recover, thereby reducing the retry load.

Enhancing Safety with Jitter

Despite the benefits of exponential backoff, if all clients retry at the same intervals, the system can still be overloaded. This is where jitter comes into play, adding a level of randomness to the retry delays:

function backoffWithJitter(attempt) {
  const base = Math.pow(2, attempt) * 100;
  return base / 2 + Math.random() * (base / 2);
}

Jitter effectively disperses the retry waves, resulting in smoother traffic flow. There are several common jitter modes:

Full jitter: Math.random() * base
Equal jitter: base/2 + Math.random() * base/2
Decorrelated jitter: min(cap, random_between(base, prev_delay * 3)) (used by AWS)

Practical Retry Pattern in JavaScript

Here's an example of a practical JavaScript function that implements retries using exponential backoff and jitter, and can be used for GET requests, polling operations, and load-on-mount queries:

async function fetchWithRetry(url, retries = 5) {
  for (let attempt = 0; attempt < retries; attempt++) {
    try {
      const res = await fetch(url);
      if (!res.ok) throw new Error("Failed");
      return await res.json();
    } catch (e) {
      const delay = backoffWithJitter(attempt);
      await new Promise((r) => setTimeout(r, delay));
    }
  }
  throw new Error("Retries exhausted");
}

Available Retry Libraries

There are several libraries that provide built-in support for retries with backoff and/or jitter:

retry (npm): This library provides retry decorators and policies.
axios-retry: This library plugs into Axios to provide retry functionality.
React Query: This library includes built-in support for retries with backoff.
SWR: This library supports retries with exponential delay through its onErrorRetry function.

When Retries Should Be Avoided

Retries should ideally be avoided for:

Mutations with side effects such as payments or sending data.
User-generated form submissions without providing user feedback.
Hard failures such as HTTP 401 Unauthorized, 403 Forbidden, and 404 Not Found; these are not likely to be resolved by simply retrying.

In essence, retries should only be used when the failure is likely to be transient.

Real-World Examples of Retry Strategies

AWS SDK

The AWS SDK employs capped exponential backoff combined with jitter for retries. It also provides configurable retry policies for every SDK.

Stripe

Stripe retries HTTP 429 and 500 errors with increasing delays. However, their documentation also provides warnings about which endpoints are safe or unsafe to retry.

Google APIs

Google APIs retry HTTP 5xx errors and rate limit errors using full jitter. They also provide Retry-After headers to control backoff.

Retry Anti-Patterns

Certain practices should be avoided when implementing retries:

Retrying without delay, as this can overload the backend.
Using a fixed delay for retries, as it can lead to synchronization storms.
Retrying indefinitely, as it can cause cascading failures.
Ignoring Retry-After headers, which are intended to control retries.
Failing silently after retries, without providing any feedback.

Conclusion: Fail Slowly, Recover Gracefully

Retrying isn't just a fallback mechanism — it's a well-thought-out strategy. It's about acknowledging the fact that failures do occur and planning for those failures in a way that doesn't contribute to the chaos. Exponential backoff ensures that retries are conducted in a careful and measured manner, while jitter introduces randomness to prevent synchronization issues. Together, they enable your frontend to be a responsible participant in a distributed environment.

In the face of adversity, the way you choose to retry can make a significant difference.

Innovative Resilience: Utilizing Exponential Backoff and Jitter for Retries under Load

Innovative Resilience: Utilizing Exponential Backoff and Jitter for Retries under Load

The Logic of Retries

The Art of Exponential Backoff

Enhancing Safety with Jitter

Practical Retry Pattern in JavaScript

Available Retry Libraries

When Retries Should Be Avoided

Real-World Examples of Retry Strategies

AWS SDK

Stripe

Google APIs

Retry Anti-Patterns

Conclusion: Fail Slowly, Recover Gracefully

Subscribe for Deep Dives

Related Posts

Timeouts — Mastering Network Resilience and Responsiveness

Delving into Server-Sent Events (SSE): Unidirectional Data Streaming Made Simple

RESTful APIs: A Deep Dive into the Backbone of Web Data Exchange