Application uptime is typically defined as a percentage of time application has been available and performing within expectations, i.e. with response times below predefined thresholds. Application downtime may be caused by factors such as:
- hardware failures
- software bugs
- application load exceeding system capacity
Managing uptime involves preventing downtime events from occurring, and if they do occur recovering from them in a timely manner. Methods for increasing uptime involve:
- infrastructure redundancy, duplicating (at a minimum) all infrastructure components
- load balancing of components which can handle a subset of the application load, e.g. application servers
- automatic or manual failover of components which are single points of failure, such as a master SQL database server
- multiple, geographically distributed data centers to handle disaster scenarios
- frequent backups and ability to restore backups rapidly
- ability to replace failed components quickly
- application architected for scalability
- ability to add hardware when needed
Many of these are very easy to accomplish in cloud deployments, in many cases easier and less expensive than when using alternatives such as managed hosting or building your own data center. When running application on EC2 you can easily:
- implement server redundancy and load balacing
- rapidly replace or add more servers – manually or automatically
- configure infrastructure to be distributed across multiple zones and regions
- frequently backup data to S3
Cloud deployments also have some challenges:
- cloud servers are more fragile than dedicated servers
- when they go down any data stored locally is gone
- many failover techniques are not available. For example one of the common ways of managing a MySQL database uptime is setting up a cluster using DRBD and Hartbeat. You simply cannot implement this on EC2.
You can architect solutions around challenges 1 and 2 above. Limited automatic failover options can create a problem for applications with a very high uptime requirements.
We are working with RightScale and following a fairly standard web architecture outlined in the diagram below.
All servers are duplicated across two availability zones. Application servers are set up to automatically scale up and down depending on the load. Database is replicated and backed up frequently.
Impact of the various failure scenarios, with this architecture in place, is outlined below:
|Application server||None||Launch new server.||None|
|Memcached server||Performance degradation||Launch new server.||None|
|Database slave||None||Launch new server.||None|
|Database master||Application down||Promote slave to master. Launch new database slave server.||< 5 min|
|Load balancer||We are using round-robin DNS load balancing. Application will appear down to 50% of the traffic.||Launch new server.||< 15 min|
|Entire zone||This is equivalent to loosing a load balancer and a database server.||Promote slave to master (if the master is down). Launch new servers.||< 15 min|
What this architecture does not handle is a simultaneous failure of both zones or the entire region. It could be improved by:
- running servers in more than 2 zones
- setting up servers in an alternative region and data replication to that region
- configuring a disaster recovery site on an alternative cloud
You can find descriptions of even more robust architectures in post such as Lessons Netflix Learned from the AWS Outage and Why Twilio Wasn’t Affected by Today’s AWS Issues. The tradeoffs are better uptime at a higher engineering cost.
Architecting for uptime is not about maximizing uptime. It is about finding a right balance between the business costs of downtime, acceptable level of risk and engineering costs of reducing the downtime risks. For some applications, with a very high uptime requirements, cloud may not be the right solution. For many applications, you can meet uptime requirements on EC2 with a level of effort, timelines and costs which are hard to match by the alternatives.