Occasionally, users choose a smaller timeout for REST publish operations (ie shorter than the default 10s for the entire request) in order to get quicker feedback of a failed publish, so that a retry can be initiated sooner, or to reduce the time that worker processes are blocked on publish requests. This note explains the issues that have to be considered with such a change.
REST message publish sequence in Ably
When a frontend receives a request to publish a message via the REST API, the following steps occur:
- request credentials are validated, and checked for compatibility with the attempted operation;
- rate limits are checked for the REST API operation;
- the request body is decoded and the messages within it and undergo simple validity checks (eg payload size, connectionId, clientId validity);
- the target channel is activated, if it is not already active in that region;
- the message is forwarded to the core message processing layer;
- rate limits are checked for the channel;
- the message is checked for existence, persisted and indexed;
- successful processing of the message is indicated to the frontend;
- the HTTP request completes, with success or failure indicated to the client.
We perform all of these steps before completing the publish request so that any of the legitimate failure or rejection conditions can be reflected in the response. In principle, we could complete the publish before all of these steps are complete, but that would mean that legitimate failures would not be indicated to the client.
There is the potential for latencies being introduced by several of these steps - either because of transit latencies (especially channel activation in a multi-region environment), network queueing latencies, or processing queueing latencies.In our architecture channel operations, after a channel has been activated, are designed to be as deterministic as possible. However, the channel activation step, in which channel resources are allocated, is inherently less deterministic and can be a significant fraction of the overall publish time if a channel is not already active. Obviously we try to keep these latencies as small as possible, but there is inevitably some variance, and all of these sources of latency will grow as system load increases. There are other causes of increased latency in Ably when the cluster configuration is changing; for example when there is a scaling operation. The cluster is elastic, and will scale on demand, but latencies increase during scaling operations because changes in configuration take time to be consistently agreed across the cluster. Obviously the latency experienced by the publisher also depends on TLS handshake and network latencies in addition to the request latency at Ably.
Impact of short timeouts n Ably
This statistical variation will mean that some fraction of publish requests will take longer than a short timeout interval (say less than 1s). In the vast majority of cases, a request will still succeed but with higher than average latency, if it misses the deadline. If the client retries the publish, then in almost any situation this will not increase the likelihood of the message being published; nor will it result in subscribers receiving the message more quickly. In fact, when the system is under load, and the queueing latencies increase, a larger fraction of requests will exceed the timeout, and the resulting retries will just create more load on the system (as well as any client worker processes making the requests) and worsen the situation. There is a significant likelihood that there will be a cascade effect, with retries, load and latency all increasing continuously, until a limit is hit (eg the request capacity of the worker pool, or a global account request rate limit in Ably). This load feedback effect would be particularly problematic during scaling operations. Typically we will perform autoscaling when the load reaches a threshold - if that causes temporary latency, but that itself triggers a significant increase in load, then that could be a significant problem.
Impact of short timeouts on publisher
The other adverse impact of short timeouts is that, if requests do fail for a legitimate reason (eg a rate limit or channel limit) then the publisher won't get to be notified of the problem. For these kinds of error conditions, repeated publish attempts will likely also fail for the same reason as the first attempt. Sometimes there is a concern is the impact on the utilisation of a worker thread pool when many threads are idle, blocked on an HTTP request with a long timeout. However, that impact will be a function of the mean request duration, not the peak duration; when the vast majority of requests complete in a very short time, the impact of occasional long requests on overall worker pool utilisation will be minimal. However, the worker pool utilisation is also impacted by whatever action is taken as a result of a timeout. If a timeout simply results in the request being re-tried, then the associated workers will not be idle, but equally they will not be generating any greater productive output - the overall workload will increase, generating repeat requests that are most likely to do nothing, because the initial publish request has since succeeded.
Timely detection and correction of publisher connection failures
The one situation in which it is legitimate to retry a request is if the connection to the Ably endpoint fails, or takes a long time to become established. A short retry will minimise the delay in sending the message in this case. This particular concern is addressed by reducing the httpOpenTimeout (which details to 4s) rather than the httpRequestTimeout.