20 February 2014

Failure resiliency and retry delay planning

When dealing with networked or external data sources I've learned the hard way that all code should be designed as to expect failures. It is significantly cheaper and easier to bake a graceful handling of errors this from the start rather than attempt to do it later on. A common first step in combating failures and providing resiliency is to have your logic retry an operation in case of a failure.

But how should you retry?

Simple retries

The most simplistic retry mode is to simply surround all your code with a while loop that executes your block a predefined number of times, ala:

int retry = 0;
do
{
   // Operation
   if( true == MyFlakyOperation() )
       break;
}
while ( ++retry < 6 )

The problem with this approach is that it completely ignores the most likely underlying reason for the failure. Congestion or resource load on the remote end could be causing your calls (and many others) to intermittently fail as the server cannot handle the incoming requests. In this case your naive implementation might actually be contributing to making the situation even worse.

So how do we solve this?

Spacing out retries

One common approach to spacing out retries is called Exponential backoff. This algorithm uses a predefined feedback (e.g. retry count) to systematically increase wait times between repeated executions of the same code to avoid congestion.

Example of exponential spacing based on 4sec base wait time.
The vertical bars indicate retry points.
The idea is that with every additional retry that is required it is more likely that the system we're communicating with is heavily congested and needs more "breathing" space to resolve its current backlog of requests.

Example of backoff retries

Below is an example of a very simple C++ algorithm snippet that performs this kind of exponential backoff based on 4sec intervals:

int success = OK;
int retry = 0;
do
{
   // Operation
   success = MyFlakyOperation();

   // Sleep if operation was not success
   if (success != OK)
   {
       int sec = static_cast(std::pow(4, retry));
       std::this_thread::sleep_for(std::chrono::seconds(sec));
   } 
}
while ( ++retry < 6 && success != OK)

In this example my algorithm has a maximum running time with full retry count of a whooping 22 min and 44 seconds! (4+16+64+256+1024 = 1364sec).

How much does the waiting time increase?

Care must be taken when choosing the interval to increment by when using a naive approach as my example above. Below is a table listing the waiting times in seconds for each retry for 2-7 second intervals.

Remember that your maximum running time is the cumulative waiting numbers for all  intervals!

Retry#2sec3sec4sec5sec6sec7sec
1
2
3
4
5
6
7
2
4
9
16
25
36
49
3
8
27
64
125
216
343
4
16
81
256
625
1,296
2,401
5
32
243
1,024
3,125
7,776
16,807
6
64
729
4,096
15,625
46,656
117,649
7
128
2,187
16,384
78,125
279,936
823,543
8
256
6,561
65,536
390,625
1,679,616
5,764,801
9
512
19,683
262,144
1,953,125
10,077,696
40,353,607
10
1,024
59,049
1,048,576
9,765,625
60,466,176
282,475,249

* so using 7 sec as a base and allowing up to 10 retries, the total maximum waiting time will be just shy of 10,5 years!