Discover how CloudZero helps engineering and finance get on the same team — and unlock cloud cost intelligence to power cloud profitability
Learn moreDiscover the power of cloud cost intelligence
Give your team a better cost platform
Give engineering a cloud cost coach
Learn more about CloudZero and who we are
Learn more about CloudZero's pricing
Take a customized tour of CloudZero
Explore CloudZero by feature
Build fast with cost guardrails
Drive accountability and stay on budget
Manage all your discounts in one place
Organize spend to match your business
Understand your cloud unit economics and measure cost per customer
Discover and monitor your real Kubernetes and container costs
Measure and monitor the unit metrics that matter most to your business
Allocate cost and gain cost visibility even if your tagging isn’t perfect
Identify and measure your software COGS
Decentralize cost decisions to your engineering teams
Automatically identify wasted spend, then proactively build cost-effective infrastructure
Monitor your AWS cost and track progress in real-time as you move to the cloud
CloudZero ingests data from AWS, GCP, Azure, Snowflake, Kubernetes, and more
View all cost sourcesDiscover the best cloud cost intelligence resources
Browse webinars, ebooks, press releases, and other helpful resources
Discover the best cloud cost intelligence content
Learn how we’ve helped happy customers like SeatGeek, Drift, Remitly, and more
Check out our best upcoming and past events
Gauge the health and maturity level of your cost management and optimization efforts
Compare pricing and get advice on AWS services including EC2, RDS, ElastiCache, and more
Learn moreDiscover how SeatGeek decoded its AWS bill and measures cost per customer
Read customer storyLearn how Skyscanner decentralized cloud cost to their engineering teams
Read customer storyLearn how Malwarebytes measures cloud cost per product
Read customer storyLearn how Remitly built an engineering culture of cost autonomy
Read customer storyDiscover how Ninjacat uses cloud cost intelligence to inform business decisions
Read customer storyLearn Smartbear optimized engineering use and inform go-to-market strategies
Read customer storyFrom performance to security to cost, these key cloud metrics will help you monitor and maintain your cloud's health.
Author Jeff Duntemann said a good tool improves how you work, whereas a great tool transforms your thinking. Companies that want to improve their cloud-based operations can rely on cloud metrics as an effective tool for transforming their cloud operations.
You can't fix what you don't measure.
Cloud metrics are the logs of data that a cloud infrastructure or application generates. Using the data, organizations can detect, monitor, and respond to various changes in costs, security, and performance of their cloud environments.
By collecting, analyzing, and acting on the right cloud metrics, you can:
So, what are the crucial cloud metrics to monitor continuously for organizational success?
We’ve covered the vital DevOps metrics to track and key SaaS metrics for reporting before. In this guide, we'll look at some of the most important cloud metrics you should be monitoring.
Table Of Contents
Cloud-based infrastructure, applications, and other components generate metrics companies can use to measure the reliability and operational excellence of their cloud services.
This metric measures the percentage of time a service or system is available to serve customer requests. Downtime is the opposite. Uptime increases your chances of retaining customers and generating revenue.
Credit: Amazon CloudWatch
CPU utilization measures the percentage of compute units you use. Tracking it will reveal if a CPU is throttling performance because of under- or over-utilization.
Memory utilization helps to measure memory usage in public, private, and hybrid cloud environments. A consistently high memory utilization may require you to scale up your memory capacity to ensure smooth performance.
Requests per minute tell how many requests a cloud-based application receives each minute. It is crucial to monitor how and when users access the app, so you can scale your cloud resources to meet demand, ensuring optimal performance.
Credit: Letswp
Disk utilization enables you to track the disk volume on a node’s storage capacity to tell if it is sufficient for your workloads. Typical storage metrics include IOPS and throughput. The IOPS metric describes the number of reads and writes per second, whereas throughput measures the amount of data transferred from and to storage in bytes per second (bps).
Average time to acknowledge refers to the average time your application takes to begin a response to a request. If acknowledgement times are slow, then there may be a load balancer issue, or the app is struggling with underprovisioning and other latency issues.
Latency measures the time between when a customer sends a request (request time) and when the cloud provider sends back a response (response time). High latency can negatively impact productivity. Your cloud provider's backend servers, web server dependencies, and network problems could all lead to increased latency.
The error rate measures how often a request results in an error. You can troubleshoot issues like improperly configured access credentials by identifying the types of errors the system generates.
Swap usage refers to the amount of disk space devoted to holding data that should be in memory. High swap usage degrades application performance and defeats the purpose of in-memory caching.
Mean Time Between Failure (MTBF) refers to the average time a repairable cloud component works before failing. It will help you understand why systems fail, so you can identify repair methods that improve MTBF and be better equipped to tolerate failures.
Mean Time to Repair helps measure the average time it takes to repair a failed cloud component and get it back in service. Analyzing the MTTR will help you understand how long it takes your company to restore service after a failure. A shorter MTTR increases your chances of retaining clients and meeting SLAs.
There are a variety of metrics to gauge the robustness of cloud infrastructure and the applications it hosts. They include:
Capacity test metrics indicate the maximum load amount or traffic your cloud system can handle without throttling performance in production.
Targeted infrastructure metrics enable you to isolate and fix problem areas of a specific layer or application component.
Stress testing metrics help gauge the stability and responsiveness of your cloud environment and its components under high loads.
Load testing metrics enable engineering to check how cloud resources perform when multiple users try to access and use them simultaneously.
Failover test metrics measure a system’s ability to call up additional cloud resources to handle heavy or peak loads.
Latency test metrics help you determine the time it takes your cloud resources to transfer data messages between two points on the network.
Soak test metrics are indicators of your cloud system’s resilience during prolonged periods of heavy traffic.
Overall, cloud performance test metrics will help verify if your system will perform efficiently in a production environment.
Keeping track of security and compliance KPIs is particularly challenging in the cloud's dynamic computing environment. Yet, it is possible. To mitigate threats, monitor metrics such as:
Patched/unpatched known vulnerabilities will indicate how timely and adequately patch cloud security risks in your system -- or if you leave them open for too long.
Requests per minute metrics not only measures cloud performance but also risks. A high number of requests per minute may indicate an ongoing threat, such as a Distributed Denial of Service (DDOS) attack.
Peer-to-peer file-sharing metrics help monitor changes in the number of files downloaded or shared through authorized means. An increase may indicate a compromised cloud security posture.
Data on violations, compliance score, and resolution progress are examples of compliance metrics.
By monitoring security and compliance metrics, you can prevent your cloud system from leaking confidential business information, customer data, and damaging your reputation.
Credit: Opsview
It was fairly straightforward to fix network issues when apps were hosted over a Local Area Network (LAN). It takes greater attention to identify the cause of network issues in the cloud.
Here are two crucial cloud networking metrics to keep an eye on:
Network capacity is the maximum data transit rate possible between a source and destination through the most congested hop in the application delivery path.
Available capacity measures the actual amount of network resources available to applications, while Utilized capacity is a strong indicator of network performance degradation. Both metrics will help you determine the root cause of service degradation.
Packet loss is a measure of the percentage loss of network packets between the source and destination. Packet loss can cause latency and network congestion when an internet protocol retransmits the data. Track this metric to make sure your system doesn't drop users' requests, resulting in customer frustration.
Network metrics provide a good indication of the kind of customer experience your organization provides. While modern networking technologies can easily handle small packet losses and jitter, sustained network problems can cause customers to unsubscribe from your service.
Cost-conscious teams treat cloud costs as a first-class metric. Engineers can then build cost-effective solutions while finance optimizes cloud costs without hindering innovation.
But it is not enough to collect high-level cost metrics that are difficult to link to actual business activity. Instead, monitor unit costs that relate to specific business activities, such as:
Cost per feature is a measure of the amount you spend to release and support a particular product feature. You can track which customers use it, when, and how often. You can also use this metric to calculate how much you need to charge for the feature to turn a profit.
The average cost per customer does not tell you how much you spend on specific customers. But with cost per customer, you can tell your most expensive customers, so you can adjust pricing across different customer segments to improve margins.
Cost per team indicates whether your team is as efficient as others in the industry or as you’d like. It is difficult for distributed teams to track their cloud costs without a robust tool.
Cost per deployment allows your engineering and finance teams to visualize how much a given deployment project costs from beginning to end. As you collect insight, you will be better able to allocate cloud resources and lower waste.
Cost metrics in containerized applications and Kubernetes clusters are difficult to monitor using most cloud cost management tools.
Multi-cloud cost metrics are generated across different clouds, making it tough to collect, enrich, and present them to various stakeholders in a language they understand.
Using CloudZero, you can easily collect, analyze, and report on multiple unit cost metrics that matter to your business. Like an observability tool, CloudZero collects cloud metrics from multiple sources, so you don't have to have perfect AWS tags to get accurate cost insights.
It also recognizes abnormal cost changes. CloudZero will then alert the right person or team instantly via Slack to prevent overspending. to see how CloudZero turns cloud metrics into powerful cost insights for your business.
This blog post was written and reviewed by the CloudZero team. Combined, our team has more than a quarter century of experience in the cloud cost space. Every blog post is extensively researched and reviewed by several members of our team for accuracy and readability.
CloudZero is the only solution that enables you to allocate 100% of your spend in hours — so you can align everyone around cost dimensions that matter to your business.