How long do real world client wait before retrying? Is there a general guidance ...

benlivengood · on Oct 27, 2020

1. Look at the latency of historical successful commits and your success_percentage. Take the latency at the success_percentage-th percentile and call it max_success_latency. This should be bounded closely from below by longest round trip time between nodes. If it's not, it's worth fixing.

2. Look at your external SLO and get target_latency and target_success_percentage from your thresholds.

3. Retry on failure. Retry on timeout as late as target_latency-max_success_latency and optimistically as early as max_success_latency. The wiggle room gives you a helpful idea of how close you are to breaking SLO. The earlier you retry, the more likely you are to overload the backend if it slows down due to load. The later you retry, the more likely you are to break SLO under load. Use a rate-limiting back-off strategy in clients to avoid overloading the backend completely. Probabilistic rate-limiting to the observed success rate (plus a little) on each client works pretty well.

4. Provision your Raft/Paxos for (1 + (1 - success_ratio))^max_retries times the maximum expected traffic to account for the load from retries.

Note that if (max_success_latency * 2) > target_latency AND success_percentage < target_success_percentage then you will need optimistic retries which can put quite a lot of load on the backend and even that may not keep you within SLO; it mostly depends on whether failures/timeouts are independent or data-dependent.

k2xl · on Oct 28, 2020

Fantastic answer. Thank you

pacaro · on Oct 27, 2020

I'm sorry to be difficult, but my late step-father would respond to questions like this with "how long is a piece of string?"

There is no useful direct answer to a question like this.

Having said that, what are your normal communication latencies between nodes, P50-95-99? what are those numbers when the system is under heavy load?

You can model the impact that different retry intervals will have on your system