Application uptime is typically defined as a percentage of time application has been available and performing within expectations, i.e. with response times below predefined thresholds. Application downtime may be caused by factors such as:

  • hardware failures
  • software bugs
  • application load exceeding system capacity

Managing uptime involves preventing downtime events from occurring, and if they do occur recovering from them in a timely manner. Methods for increasing uptime involve:

  • infrastructure redundancy, duplicating (at a minimum) all infrastructure components
  • load balancing of components which can handle a subset of the application load, e.g. application servers
  • automatic or manual failover of components which are single points of failure, such as a master SQL database server
  • multiple, geographically distributed data centers to handle disaster scenarios
  • frequent backups and ability to restore backups rapidly
  • ability to replace failed components quickly
  • application architected for scalability
  • ability to add hardware when needed

Many of these are very easy to accomplish in cloud deployments, in many cases easier and less expensive than when using alternatives such as  managed hosting or building your own data center. When running application on EC2 you can easily:

  • implement server redundancy and load balacing
  • rapidly replace or add more servers – manually or automatically
  • configure infrastructure to be distributed across multiple zones and regions
  • frequently backup data to S3

Cloud deployments also have some challenges:

  1. cloud servers are more fragile than dedicated servers
  2. when they go down any data stored locally is gone
  3. many failover techniques are not available. For example one of the common ways of managing a MySQL database uptime is setting up a cluster using DRBD and Hartbeat. You simply cannot implement this on EC2.

You can architect solutions around challenges 1 and 2 above. Limited automatic failover options can create a problem for applications with a very high uptime requirements.

Our Solution

We are working with RightScale and following a fairly standard web architecture outlined in the diagram below.

ec2_architecture

All servers are duplicated across two availability zones. Application servers are set up to automatically scale up and down depending on the load. Database is replicated and backed up frequently.
Impact of the various failure scenarios, with this architecture in place, is outlined below:

Failure Impact Recovery Action Downtime
Application server None Launch new server. None
Memcached server Performance degradation Launch new server. None
Database slave None Launch new server. None
Database master Application down Promote slave to master. Launch new database slave server. < 5 min
Load balancer We are using round-robin DNS load balancing. Application will appear down to 50% of the traffic. Launch new server. < 15 min
Entire zone This is equivalent to loosing a load balancer and a database server. Promote slave to master (if the master is down). Launch new servers. < 15 min

What this architecture does not handle is a simultaneous failure of both zones or the entire region. It could be improved by:

  1. running servers in more than 2 zones
  2. setting up servers in an alternative region and data replication to that region
  3. configuring a disaster recovery site on an alternative cloud

You can find descriptions of even more robust architectures in post such as Lessons Netflix Learned from the AWS Outage and Why Twilio Wasn’t Affected by Today’s AWS Issues. The tradeoffs are better uptime at a higher engineering cost.

Summary

Architecting for uptime is not about maximizing uptime. It is about finding a right balance between the business costs of downtime, acceptable level of risk and engineering costs of reducing the downtime risks. For some applications, with a very high uptime requirements, cloud may not be the right solution. For many applications, you can meet uptime requirements on EC2 with a level of effort, timelines and costs which are hard to match by the alternatives.

Advertisements