This article is part of a two-part series on cost allocation. Find the second article, “Are We There Yet? How To Know When You’ve Got Deep Enough Cloud Cost Metrics” here.
High-quality cloud cost allocation has become an existential issue for businesses. In order to get as much out of their (mounting) cloud investments as possible, business leaders need to know how much they’re spending in the cloud, what/who they’re spending it on, and whether there’s a good reason for it.
In its ideal form, cost allocation answers all these questions. It matches what you spend with who you spend it on, giving you business-relevant metrics like cost per customer, cost per product, cost per feature, etc.
But how do you get to that ideal form? How do you go from inefficient, semi-accurate cost allocation (usually done through manual tagging) to efficient, precise cost allocation?
Two main methods have emerged: Top-down allocation and Bottom-up allocation. Different tools use one or the other (ours uses top-down). As their names indicate, they’re opposites — bottom-up allocation starts at the most granular resource level and then generalizes, whereas top-down allocation starts at the highest level of spend and then gets specific.
Think of it in terms of searching for sunken treasure at sea. The bottom-up method would be like starting by combing the whole ocean floor, whereas the top-down method would be like using radar to identify promising places to drop anchor.
If you had infinite time and money, combing the whole ocean floor would yield the best results, ten times out of ten.
But most of us don’t have infinite resources. Time is scarce, money is finite, and we invest based on probabilities. Because of this, CloudZero uses top-down allocation, which starts broad, focusing on the highest-impact areas first, and gets more granular based on results and experience.
Let’s dive deeper (pun very much intended) into why we prefer top-down allocation.
3 Reasons Why Top-Down Allocation Is The More Effective Approach
1. Top-down allocation has no resource limits
Bottom-up cost allocation tools (like Amazon Cost Profiler, or ACP) start by targeting individual resources, like EC2, Lambda, and others. Having entered at that level, they then go upward, exploring each customer/product/feature that uses those resources.
The problem here is that you’re limited by the number of resources a tool is built to target. Most of these tools can only handle a handful of resources. ACP lists only four: EC2 instances, SQS queues, SNS topics, and DynamoDB reads and writes.
If you use any other resource that the tool isn’t calibrated to target — like RDS, S3, CloudFront, API Gateway, or EBS — it’s impossible to include it in your overall cost allocation.
Rather than starting at the resource level, top-down solutions like CloudZero start with all production spend, allocated to your business context.
Step one is grouping spend from all your providers — AWS, GCP, AZURE, Snowflake, New Relic, etc. — and putting it into business context.
CloudZero uses a custom domain-specific language, called CostFormation, that lets you apply flexible rules to your billing data in order to aggregate resources into business-centric Dimensions.
Step two is allocating costs (especially those hard-to-allocate shared costs!) based on real customer and product usage. Choose a metric that represents how your customers are using your platform (e.g., number of messages processed, duration of a query, or length of video streams processed) and correlate it with your cost. CloudZero does this with telemetry streams.
This way, with a single metric, you are dynamically allocating all the spend from all cloud services, and thus all the resources within.
Step three is layering in additional metrics to get even more specialized views of your spend.
For example, after starting with cost per customer, maybe you go to cost per product per customer, then cost per feature per product per customer. All this (and more) can be achieved by adding new telemetry streams and crafting custom Dimensions.
With bottom-up spend, it takes a lot of initial work to get usable unit costs because you’re only getting data from a subset of your spend at a time. With top-down spend, you get usable information with just one metric, and then dive as deep as you want.
2. Top-Down Allocation Gives You Only The Granularity You Need
It might seem like you’d want as granular, universal view into your cloud spend as possible. There’s some truth to that — the more time you spend analyzing your cloud spend, the more you’ll get out of an atomic (if not subatomic) view.
But we maintain it’s better to get granular based on real business impact than to start granular and hope for business impact.
Think back to the sunken treasure metaphor: What’s the main piece of information you want? The location of treasure chests.
Do you care about the location of coral reefs, deep sea vents, crab colonies, jellyfish clusters, worm civilizations, and everything else you’ll find by scraping the ocean floor?
You don’t. But a bottom-up approach would give you all of this data, meaning you’d have to sift through a lot of information that, while interesting, would have little to no impact on your business goal.
But if you started general, using radar to identify the location of sunken ships, for example, you’d be able to conduct a more meaningful, efficient search.
At the start of your cloud cost allocation journey, you don’t know precisely what cost information is going to help you make better business decisions. It’s better to start slow and learn it as you go than to throw yourself into a data ocean and hope for the best.
3. Top-Down Allocation Is Designed to Facilitate Your FinOps Journey
If you’ve spent any time researching FinOps, you know it’s based on a maturity framework:
Crawl → Walk → Run
Crawl is about putting the people, teams, processes, and tools in place, and then starting to collect information.
Walk is about getting deeper information, having more meaningful conversations, and strategizing about maximizing cloud ROI.
Run is where you’ve got total allocation, dynamic unit cost metrics, high levels of FinOps enthusiasm, and a flywheel to make sure systems are always improving.
Top-down allocation mirrors the Crawl → Walk → Run framework. You start with high-level cost data, expend relatively little effort getting your first metric (many of our partners start with Cost Per Business Unit or Cost Per Product Feature), and get into deeper, more targeted metrics over time.
(A really interesting example of nuanced metrics is Beamable, a company that provides infrastructure to video game designers, who we ultimately helped see cost per service per game per customer.)
The bottom-up approach expects you to start at the “Run” stage. It requires you to instrument a wide array of metrics before you get any usable unit cost information. That’s just not reasonable.
Starting with more granularity than you need can be a burden, and having to produce many granular metrics to get any unit data is like skipping “Crawl” and “Walk” and going straight to “Run.” Moreover, since most organizations are at the “Pre-crawl,” “Crawl,” or “Walk” stages of FinOps, using a bottom-up approach is like entering a toddler into an Olympic sprint.
The overarching benefit of top-down allocation is that it lets you craft a powerful cost allocation schema over time — based on real business impact. No two organizations’ business models are identical, and thus, the information they’ll need to make sound business decisions won’t be identical either.
We use our own platform, using five telemetry streams to get unit cost data. One of our bigger customers uses 200 telemetry streams, another uses twenty-eight streams, and others use a range of numbers in between. It all depends on what their needs are, and what it takes to get real business impact.
Starting with total granularity gives you an enormous burden, right upfront. Starting general and figuring out where to drill gives you a tailored system, producing data at the pace that you can handle it. Ready to start looking for sunken treasure? !
In Part II of this series, I answer the most common follow-up question I get: “How does my organization know when we’ve got deep enough granularity?” Or, in other words: “Are we there yet?”. Read it here.