Firstly, we're sorry that you're experiencing disruption. If there is an incident ongoing, please have a look at the current incidents for information. If the is no current incident listed there, but you believe that there is disruption, then please contact support or open a support ticket. We're committed to ensuring that we provide the best possible service for all of our customers, and we need to know if you're having issues. We do provide an uptime guarantee as part of our Service Level Agreement, so if you have a paid account, please let us know what you're experiencing so we can determine whether or not you're eligible for service credit. Again, we're sorry if disruption is having an impact for you, and we will do what we can to remedy that.
The question now is why we're not indicating downtime in the status site. The most likely reason is that the service is still generally available, but some incident is causing local disruption - either in a specific region, or only for certain accounts, or perhaps only affecting certain operations. The reality is that disruption never has a binary outcome - the service being either up or down - instead, customers experience higher than normal error rates, or latencies, or perhaps failures for specific channels. The uptime indication on the status site is a reflection of whether or not the service is generally available and, on those few occasions that there is service disruption that has a significant adverse impact, we do see that operations continue for most customers with little or no impact, whereas some customers are impacted significantly.
We would like to have a more objective way to define when there is degraded service, both generally and for a specific app or account. We are looking at how we can do this, but it is a genuinely complex problem to come up with an objective measure that is representative of the actual disruption or impact experienced by a given customer. In the future, for example, we plan to gather specific error statistics for apps and accounts where there are error counts for each error code, and we could use these stats as part of a more objective definition of when there is degraded service, for one customer or generally.
Note also that it is the nature of internet-scale operations that specific regions do experience issues - either because of problems in our own operations, or those of the providers we depend on such as Amazon AWS. We address this issue by having service available in multiple regions globally, and ensuring that our libraries will access the Ably service in a different region if they experience errors in their closest region. Sometimes we will explicitly route traffic away from failing regions in order to minimise the impact until the situation is resolved. This kind of situation is unavoidable on occasions - we would not indicate this as the service being unavailable unless we continue to route traffic to the failing region.
So, if you are experiencing disruption, but we are not indicating downtime on the status site then:
- We're sorry - we do aim to provide the best possible service at all times;
- Check out latest incident updates for more information;
- Let us know. We do stand by our SLA commitments (and for clarity, nothing in this article modifies or takes away from the effectiveness of the SLA, and the SLA controls in the event of any conflict or inconsistency between this posting and the SLA), and we will try always to be reasonable in remedying any issues that you experience.