Following attempting a Practice Exam as part of my own studying I’ve been searching the internet for some sample questions. I’m going to try and break down what I believe to be the correct answer and the rationale of how I got to that answer. I’m happy for people to comment and provide their input as well so feel free to let me know if you disagree.
For a 3-Tier, Customer facing, inclement weather site utilising a MySQL database running in a Region which has 2 AZ’s (Availability Zone), which architecture provides fault tolerance within the Region for the application that minimally requires 6 web tier servers and 6 application tier servers running in the web and application tiers and one MySQL database? (Choose 1)
a. A web tier deployed in 1 AZ with 6 EC2 (Elastic Cloud Compute) instances inside an Auto Scaling Group behind an ELB (Elastic Load Balancer), and an application tier deployed in the same AZ with 6 EC2 instances inside an Auto Scaling Group behind an ELB, and a Multi-AZ RDS (Relational Database Service) deployment, with 6 stopped web tier EC2 instances and 6 stopped application tier EC2 instances all in the other AZ ready to be started if any of the running instances in the first AZ fails.
b. A web tier deployed in 2 AZ with 6 EC2 (Elastic Cloud Compute) instances in each AZ inside an Auto Scaling Group behind an ELB (Elastic Load Balancer), and an application tier deployed in 2 AZs with 6 EC2 instances in each AZ inside an Auto Scaling Group behind an ELB, and a Multi-AZ RDS (Relational Database Service) deployment.
c. A web tier deployed in 2 AZ with 3 EC2 (Elastic Cloud Compute) instances in each AZ inside an Auto Scaling Group behind an ELB (Elastic Load Balancer), and an application tier deployed in 2 AZs with 3 EC2 instances in each AZ inside an Auto Scaling Group behind an ELB, and a Multi-AZ RDS (Relational Database Service) deployment.
d. A web tier deployed in 2 AZ with 3 EC2 (Elastic Cloud Compute) instances in each AZ inside an Auto Scaling Group behind an ELB (Elastic Load Balancer), and an application tier deployed in 2 AZs with 6 EC2 instances in each AZ inside an Auto Scaling Group behind an ELB, and one RDS (Relational Database Service) instance deployed with read replicas in the other AZ
The way I personally always try to approach a question is to try and rule out the obvious incorrect answer to reduce the scope of answers.
Therefore when I look at “Answer D” its saying to deploy an RDS Read Replica for the MySQL Database. Given the question is looking for the right architecture for a fault tolerance then a Read Replica is the wrong option. Read Replicas should be utilised for scaling out databases. Similarly a Read Replica would utilise asynchronous replication as opposed to synchronous replication that is what a Multi-AZ databases would provide. Its important to understand the difference between synchronous and asynchronous replication and therefore I’d suggest reading through the AWS RDS Frequently Asked Questions. The following is an extract from the FAQ:
Multi-AZ (Synchronous Replication)
“If you are looking to use replication to increase database availability while protecting your latest database updates against unplanned outages, consider running your DB instance as a Multi-AZ deployment. When you create or modify your DB instance to run as a Multi-AZ deployment, Amazon RDS will automatically provision and manage a “standby” replica in a different Availability Zone (independent infrastructure in a physically separate location). In the event of planned database maintenance, DB instance failure, or an Availability Zone failure, Amazon RDS will automatically failover to the standby so that database operations can resume quickly without administrative intervention. Multi-AZ deployments utilize synchronous replication, making database writes concurrently on both the primary and standby so that the standby will be up-to-date in the event a failover occurs. While our technological implementation for Multi-AZ DB Instances maximizes data durability in failure scenarios, it precludes the standby from being accessed directly or used for read operations. The fault tolerance offered by Multi-AZ deployments make them a natural fit for production environments.”
Read Replicas (Asynchronous Replication)
“Read Replicas are supported by Amazon Aurora, Amazon RDS for MySQL, MariaDB and PostgreSQL. Unlike Multi-AZ deployments, Read Replicas for these engines use each’s built-in replication technology and are subject to its strengths and limitations. In particular, updates are applied to your Read Replica(s) after they occur on the source DB instance (“asynchronous” replication), and replication lag can vary significantly. This means recent database updates made to a standard (non Multi-AZ) source DB instance may not be present on associated Read Replicas in the event of an unplanned outage on the source DB instance. As such, Read Replicas do not offer the same data durability benefits as Multi-AZ deployments. While Read Replicas can provide some read availability benefits, they and are not designed to improve write availability.”
Now that we have removed one of the possible answers lets try to remove another. Further reading that I would advise is a couple of the AWS Whitepapers, these specifically being:
- Architecting for the Cloud – AWS Best Practices
- Using Amazon Web Services for Disaster Recovery
- Building Fault Tolerant Applications on AWS
The Best Practices Whitepaper covers the key design principles which this question is clearly targeting and in my opinion its focusing on the “Optimize for Cost” and the “Removing Single Points of Failure” principles.
“Single points of failure can be removed by introducing redundancy, which is having multiple resources for the same task. Redundancy can be implemented in either standby or active mode.
In standby redundancy when a resource fails, functionality is recovered on a secondary resource using a process called failover. The failover will typically require some time before it completes, and during that period the resource remains unavailable. The secondary resource can either be launched automatically only when needed (to reduce cost), or it can be already running idle (to accelerate failover and minimize disruption). Standby redundancy is often used for stateful components such as relational databases.
In active redundancy, requests are distributed to multiple redundant compute resources, and when one of them fails, the rest can simply absorb a larger share of the workload. Compared to standby redundancy, it can achieve better utilization and affect a smaller population when there is a failure.”
The question is calling out for fault tolerance. The definition of fault tolerance is “Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of (or one or more faults within) some of its components. If its operating quality decreases at all, the decrease is proportional to the severity of the failure, as compared to a naively designed system in which even a small failure can cause total breakdown.”
I believe we can rule out “Answer A” since in the event of any single component be it the Service would need some sort of manual intervention to restore the minimum requirement of 6 Web Servers and 6 Application Servers. The Multi-AZ database would meet the requirements along with the ELB, but given the EC2 instances in the secondary availability zone are powered off, these would need to be manual spun up rather than using automation to do this.
As I eluded to a little earlier in this post I mentioned another Whitepaper that I haven’t focused on as of yet that relates to disaster recovery. Within the Whitepaper it covers 4 examples scenarios for disaster recovery:
- Backup and Recovery
- Pilot Light
- Warm Standby
- Multi Site
I’m not going to go into these example scenarios as if you’re starting to study for the exam yourself or have been working in solution design or architecture for a period of time you’ll understand the concept of these.
Reviewing the remaining answers we have “Answer B” and “Answer C” both relates to the Multi Site scenario. As you move further towards the Multi-Site scenario you find that typically the RTO and RPO requirements become more stringent and the costs also increase to meet those said requirements.
Reading “Answer B” the option is to implement a total of 24 EC2 instances across 2 availability zones (12 as Web Servers and 12 as Application Servers). All of these Servers are also members of auto-scaling groups so can scale to meet demand if required. In my opinion this appears to be way over-sized and not in keeping with the “Optimize for Cost” principle, I’m not certain that we can’t rule it out given that in the event of any single component be it Database, Application Tier, Web Tier or ELB the Service can still function with the requirement of at least 6 Web Servers and Application Servers.
“Answer C” on the other hand is to implement a total of 12 EC2 instances across 2 availability zones (6 as Web Servers and 6 as Application Servers). All of these Servers are also members of auto-scaling groups so can scale to meet demand if required. I believe we can rule out “Answer C” since in the event of any single component the service would not be meeting the minimal requirement of 6 Web Servers and 6 Application Servers although if there was a failure at the Web or Application Tier the auto-scaling policy would soon re-provision the required resource.
Therefore in my opinion the correct option is “Answer B”.