Harðkjarni: February 2014

When dealing with networked or external data sources I've learned the hard way that all code should be designed as to expect failures. It is significantly cheaper and easier to bake a graceful handling of errors this from the start rather than attempt to do it later on. A common first step in combating failures and providing resiliency is to have your logic retry an operation in case of a failure.

But how should you retry?

Simple retries

The most simplistic retry mode is to simply surround all your code with a while loop that executes your block a predefined number of times, ala:

int retry = 0;

do
{
   // Operation
   if( true == MyFlakyOperation() )

       break;
}

while ( ++retry < 6 )

The problem with this approach is that it completely ignores the most likely underlying reason for the failure. Congestion or resource load on the remote end could be causing your calls (and many others) to intermittently fail as the server cannot handle the incoming requests. In this case your naive implementation might actually be contributing to making the situation even worse.

So how do we solve this?

Spacing out retries

One common approach to spacing out retries is called Exponential backoff. This algorithm uses a predefined feedback (e.g. retry count) to systematically increase wait times between repeated executions of the same code to avoid congestion.

Example of exponential spacing based on 4sec base wait time.
The vertical bars indicate retry points.

The idea is that with every additional retry that is required it is more likely that the system we're communicating with is heavily congested and needs more "breathing" space to resolve its current backlog of requests.

Example of backoff retries

Below is an example of a very simple C++ algorithm snippet that performs this kind of exponential backoff based on 4sec intervals:

int success = OK;
int retry = 0;
do
{
   // Operation
   success = MyFlakyOperation();

   // Sleep if operation was not success
   if (success != OK)
   {
       int sec = static_cast(std::pow(4, retry));
       std::this_thread::sleep_for(std::chrono::seconds(sec));
   } 
}

while ( ++retry < 6 && success != OK)

In this example my algorithm has a maximum running time with full retry count of a whooping 22 min and 44 seconds! (4+16+64+256+1024 = 1364sec).

How much does the waiting time increase?

Care must be taken when choosing the interval to increment by when using a naive approach as my example above. Below is a table listing the waiting times in seconds for each retry for 2-7 second intervals.

Remember that your maximum running time is the cumulative waiting numbers for all intervals!

Retry#	2sec	3sec	4sec	5sec	6sec	7sec
1	2	3	4	5	6	7
2	4	9	16	25	36	49
3	8	27	64	125	216	343
4	16	81	256	625	1,296	2,401
5	32	243	1,024	3,125	7,776	16,807
6	64	729	4,096	15,625	46,656	117,649
7	128	2,187	16,384	78,125	279,936	823,543
8	256	6,561	65,536	390,625	1,679,616	5,764,801
9	512	19,683	262,144	1,953,125	10,077,696	40,353,607
10	1,024	59,049	1,048,576	9,765,625	60,466,176	282,475,249

* so using 7 sec as a base and allowing up to 10 retries, the total maximum waiting time will be just shy of 10,5 years!

20 February 2014

Failure resiliency and retry delay planning

Simple retries

Spacing out retries

Example of backoff retries

How much does the waiting time increase?