The short version


Unlike other pub/sub platforms, Ably is a stateful system which provides messaging with a high quality of service.  As such, there is a "cost" when creating new channels so the first publish on a channel where there are no subscribers can take longer than subsequent publishes.


However, this publish request latency has no material impact on performance for subscribers. If a client is subscribed to a channel, then that channel will be provisioned within the Ably platform already, and as such a publish to that channel can be done with very low latency.  If a channel however does not yet exist, because there is no one subscribed for messages, then a publish may take longer as we provision the channel within Ably. However, this time to provision a channel has no meaningful impact because no one is subscribed at the time i.e. no one cares that latency is high if no one is receiving the messages.


Note that our stateful design is what allows us to provide a higher quality of service than other platforms.  We only consider a message published once it has been stored in at least two data centers.  Typically other pub/sub systems will ACK the message once received, yet at that point it is not guaranteed to be delivered as it is typically only stored on the server that processed the request. So this low latency to receive a request is a red herring. The time to ACK a publish operation may sometimes take longer with Ably, but this means publishes are more reliable, and importantly it has no impact on the time it takes for a message to be delivered, which is what matters most.


The long version


When a frontend receives a request to publish a message via the REST API, the following steps occur:
  • request credentials are validated, and checked for compatibility with the attempted operation;

  • rate limits are checked for the REST API operation;

  • the request body is decoded and the messages within it and undergo simple validity checks (eg payload size, connectionId, clientId validity);

  • the target channel is activated, if it is not already active in that region;

  • the message is forwarded to the core message processing layer;

  • rate limits are checked for the channel;

  • the message is checked for existence, persisted and indexed;

  • successful processing of the message is indicated to the frontend;

  • the HTTP request completes, with success or failure indicated to the client.


We perform all of these steps before completing the publish request so that any of the legitimate failure or rejection conditions can be reflected in the response. In principle, we could complete the publish before all of these steps are complete, but that would mean that legitimate failures would not be indicated to the client.


Publish latency


There is the potential for latencies being introduced by several of these steps - either because of transit latencies (especially channel activation in a multi-region environment), network queueing latencies, or processing queueing latencies.


In our architecture channel operations, after a channel has been activated, are designed to be as deterministic as possible. However, the channel activation step, in which channel resources are allocated, is inherently less deterministic and can be a significant fraction of the overall publish time if a channel is not already active.


We try to keep these latencies as small as possible, but there is inevitably some variance, and all of these sources of latency will grow as system load increases.


There are other causes of increased latency in Ably when the cluster configuration is changing; for example when there is a scaling operation. The cluster is elastic, and will scale on demand, but latencies increase during scaling operations because changes in configuration take time to be consistently agreed across the cluster.


The latency experienced by the publisher also depends on TLS handshake and network latencies in addition to the request latency at Ably.


Impact of short timeouts in Ably


This statistical variation will mean that some fraction of publish requests will take longer than a short timeout interval (say less than 1s). In the vast majority of cases, a request will still succeed but with higher than average latency, if it misses the deadline. If the client retries the publish, then in almost any situation this will not increase the likelihood of the message being published; nor will it result in subscribers receiving the message more quickly. In fact, when the system is under load, and the queueing latencies increase, a larger fraction of requests will exceed the timeout, and the resulting retries will just create more load on the system (as well as on the workers making the requests) and worsen the situation. There is a significant likelihood that there will be a cascade effect, with retries, load and latency all increasing continuously, until a limit is hit (eg the request capacity of the worker pool, or a global account request rate limit in Ably).


This load feedback effect would be particularly problematic during scaling operations. Typically we will perform autoscaling when load reaches a threshold - if that causes temporary latency, but that itself triggers a significant increase in load, then that could be a significant problem.


Impact of short timeouts on publisher


The other adverse impact of short timeouts is that, if requests do fail for a legitimate reason (eg a rate limit or channel limit) then the publisher won't get to be notified of the problem. For these kinds of error conditions, repeated publish attempts will likely also fail for the same reason as the first attempt.


The concern is the impact on the utilisation of a worker thread pool when many threads are idle, blocked on an HTTP request with a long timeout. However, that impact will be a function of the mean request duration, not the peak duration; when the vast majority of requests complete in a very short time, the impact of occasional long requests on overall threadpool utilisation will be minimal.


However, the threadpool utilisation is also impacted by whatever action is taken as a result of a timeout. If a timeout simply results in the request being retried, then the associated workers will not be idle, but equally they will not be generating any greater productive output - the overall workload will increase, generating repeat requests that are most likely to do nothing, because the initial publish request has since succeeded.


Timely detection and correction of publisher connection failures


The one situation in which it is legitimate to retry a request is if the connection to the Ably endpoint fails, or takes a long time to become established. A short retry will minimise the delay in sending the message in this case. It should be possible to address this issue by configuring ClientOptions.httpOpenTimeout which enforces a connection timeout (instead of just a timeout spanning the entire request and response.


What does this all mean?


The upshot of this is that if you want the very first publish to be low-latency, consider attaching to the channel you'll be using explicitly, in advance of when the first publish will happen, using channel#attach, rather than relying on the publish implicitly attaching to the channel. (That function takes a callback or language equivalent, so you can be notified and take action once the channel has successfully attached).


For example, the following code will ensure that the first publish has low latency:


let client = new Ably.Realtime({ authUrl: '/example_auth_url' });
let channel = client.channels.get('foo');

/* Attach and provision the channel globally */
channel.attach(function(err) {
  if (err) { return console.error("Channel attach failed", err); }
  /* Channel is now created, so the publish can happen immediately as the channel is "ready" */
  channel.publish("name", "data", function(err) {
    if (err) { return console.error("Channel publish failed", err); }
    console.log("Publish on an already provisioned channel succeeded quickly");
  });
});


You can also make use of Idempotent Publishing to reduce these issues.


Further reading