Routing around network and DNS issues

When a client connects to Ably, they connect using either our realtime endpoint realtime.ably.io or communicate over REST using rest.ably.io. Clients are automatically connected to the closest data centre through the use of our latency based routing using DNS.  When a client sends a request to our DNS servers, the DNS servers determine the location of the client, and respond with one or more IP addresses from the closest data centre (closest meaning the one with the lowest latency).

 

In the event that a data centre is unhealthy, Ably automatically stops routing all traffic to that data centre at a DNS level, and ensures that any DNS requests are directed to the next closest healthy data centre.  As there are over 16 data centres and 175+ edge acceleration points of presence globally, and our DNS has a TTL of 60 seconds, traffic is very quickly rerouted from unhealthy data centres.

 

However, in order for us to provide a 99.999% uptime guarantee, relying solely on DNS to fix connectivity issues is not enough for three reasons:

 

  • The DNS TTL is 60 seconds, and it may take 30 seconds before a region is confirmed unhealthy by our health monitoring systems.  Therefore a client who is being routed to an unhealthy data centre could be disconnected for around two minutes
  • If the client is suffering from a network routing issue caused potentially by their own ISP, DNS will not be able to route around the problem as only clients from that ISP are experiencing the problem and the service is in fact still healthy for that data centre
  • Should the entire domain fail, which could feasibly be caused by a root level domain issue outside of our control, then no DNS responses will be delivered to any clients querying *.ably.io
 

Client library fallbacks and secondary domain endpoints

 

All of our Ably client libraries have a built in fallback mechanism that is able to route around DNS TTL delays, network routing or partitioning issues, and even complete ably.io domain failures.  If a client is unable to connect to, or send a REST request to Ably on the default endpoint, yet is connected to the internet, then the client will randomly pick another data centre to connect to using an alternative secondary backup domain.  Ably operates a completely segregated secondary domain *.ably-realtime.com that is designed to isolate any and all DNS failures on the primary ably.io domain.

 
Whilst it is possible that a client when using the DNS fallback mechanism will connect to a data centre further away, we believe the fact that the client is guaranteed to connect to Ably far outweighs the downside of the potential increased latency of at most 150ms.
 

Further reading