Designing Your Cloud With Failure In Mind

Table Of Contents

Identifying Potential Failures Best Practices To Consider Failures Should Always Be Assumed

Implementing any cloud development project can be tricky, and frustrating. Especially when you are pressured with time, reactive approaches, or cost-saving scenarios.

However, there are some things you can do to implement solutions in your cloud architecture for long-term scalability and risk mitigation. Rather than short-term fixes until it arises again, consider designing your cloud with failure in mind, or speculating worst-case scenarios.

It might sound counterintuitive or obvious. However, it is a critical component to improving, maintaining, and building your cloud environment.

Therefore, you will need to understand how to identify your failures by thinking of some basic questions and working from there. This will assist in improving your cloud environment so it can operate at full scale.

Even better, it can help improve your overall service, design proposals for approval, and lessen operational pain for your team.

Identifying Potential Failures

Some basic failures can be easily identified with the following questions:

Is there a single point of failure?
What types of logs and metrics are in place?
What dependencies and components are being utilized?
What security measures and protocols have been implemented?

Let’s understand these questions a little better.

Is there a single point of failure?

Do you have one resource being utilized in the cloud that will cause everything else in your operation to fail?

Some examples might be a cloud host, an API or other endpoint, or a database. Let’s break these down.

Cloud hosts

If you have a host that is the main dependency for your service or architecture you are bound to face issues if it were to timeout, throttle, or completely shut down. Meaning, that all your components and programs running on it will fail.

It will affect customers if they aren’t able to communicate or utilize your service to its full scale. Effects can include slow and buggy issues that slow down their workflows or cause service disruptions for them too.

APIs and Endpoints

If you have an API or endpoint that your cloud relies on, it can cause major disruptions to your service if something happens to it. If your cloud components retrieve or send data via an API you can find instances where that data is now lost, corrupted, or flooding your storage space.

Another issue can be that you rely on APIs to create various web pages or dashboards. If an API or endpoint suddenly stopped working you no longer have what you need to present that webpage or dashboard that keeps your visitors coming back.

Databases

It is critical to know if any constraints or restrictions apply to the databases you are utilizing. Examples can include space and capacity, data formats and structures, and downstream or upstream processes.

By understanding your databases and how data is received, stored, and transferred you can determine possible failures. By identifying those failures, you can implement tools and resources to prevent issues.

What metrics and logs are in place within my cloud architecture?

While generic logs and metrics might be built into your cloud components, they may not necessarily have enough information to help you understand what is happening.

For instance, if a cloud component or script were to fail would you be able to identify it? Even better, would you be able to catch it before it affects your customers?

Metadata is great but can be completely useless if you aren’t capturing the right data, or don’t know how to read it. Determine what you want to prevent from happening and design data capture mechanisms that can help.

What dependencies and components are being utilized?

When developing programs you will find yourself utilizing various packages and libraries. This holds for cloud development. Most cloud services have their own software and cloud development kits that will require you to utilize them.

It’s important to keep them up-to-date along with other dependencies your code might rely on. Many cloud-based dependencies will require you to do so because they will no longer accept certain modules or programs that are utilizing outdated versions and vulnerabilities that they present.

What security measures and protocols are in place?

Good and simple security measures are easily and often overlooked in cloud development. However, by leaving them out your cloud is now vulnerable to issues and security risks.

Some examples include data breaches and/or leaks, unauthorized access, and cyber attacks. While obvious, these can be detrimental and some of the easiest failures to occur if you don’t consider them.

Best Practices To Consider

Responding to each of these will differ widely depending on your cloud’s configurations and tools being utilized.

However, keep some of these best practices in mind when developing, maintaining, and scaling your cloud architecture and operations. Additionally, aim to automate these to lessen your manual processes and make them a reality of happening.

A single point of failure

Implement disaster recovery plans and implement multiple components that can reroute your operations without downtime.

Building tests and jobs in your cloud to communicate with APIs and endpoints is a great way to understand that they’re still working. Setting up percentage threshold monitors and alarms is another great way to determine issues with throttling and other problems within components and resources.

By developing these you can prevent failures that are within control from occurring.

Logs and metrics

Develop robust logs with try and except logic, and other messages into your programs that can help you understand what your logs are telling you. Some of these can include:

Attempts to send or retrieve data with an API/endpoint;
Identifying bugs in your code due to a dependency or module update;
Or store metric and alarm information in case metric visuals aren’t working.

Implement metrics that tell a story of how your cloud is operating. Metrics can be a quick way to troubleshoot an issue. With the right metrics in place, you can also build processes to understand if a component is being throttled or leading to failure.

Dependencies and components

Creating notifications and integration tests for your programs and within your cloud pipelines will assist in ensuring everything is running as expected.

Consider pre-production environments and implementing canaries to assist with the operations and workflows. Consider creating or implementing tools to relay host and pipeline health to provide potential service failures. Alarms, metrics, and logs will also assist in determining any problems with dependencies and components.

Security measures and protocols

Implement access policies and roles in access management tools. Make policies as granular as possible with limited permissions along with justification documentation when assigning roles to an individual.

Implement security protection for databases, buckets, and blob storage containers. Put permissions and roles in place to execute programs, scripts, and access monitoring tools.

Consider private and restricted domains, host control and security policies, Virtual Private Clouds (VPCs), Virtual Private Networks (VPNs), routing protocols, and data capture for logs. Additionally, implement alerts and alarms of various severities to indicate violations or unauthorized access.

Failures Should Always Be Assumed

Assuming failure is one of the best things that you can do in your cloud design process. Automating them is ideal and necessary to keep up.

By consistently designing and applying resolutions in this way you will create an ideal cloud architecture where no failure is left out.

Not to mention this will allow you to implement painless risk mitigation, robust optimization, and effective and efficient scaling as needed. Overall, it will provide a reliable service that customers and stakeholders can trust and commit to for the long term.

Author: Amanda Chavez

With a passion for technology, Amanda Chavez has worked extensively with the cloud. Some of her specialties include Linux systems, platform and system design, cloud security, automation, operational effectiveness, and cost efficiency. She continuously grows her knowledge and experience with the ever-changing cloud.

The Cloud Cost Playbook

The step-by-step guide to cost maturity

Any Cost Source, All In One View

The Cloud Cost Playbook

Designing Your Cloud With Failure In Mind

Identifying Potential Failures

Is there a single point of failure?

Cloud hosts

APIs and Endpoints

Databases

What metrics and logs are in place within my cloud architecture?

What dependencies and components are being utilized?

What security measures and protocols are in place?

Best Practices To Consider

A single point of failure

Logs and metrics

Dependencies and components

Security measures and protocols

Failures Should Always Be Assumed

The Cloud Cost Playbook

Suggested Articles