If it hasn’t happened already, someone in your organization will soon ask you to explain your AWS bill. Depending on the complexity of your cloud environment, you will consider the bill to be something between a riddle and an incomprehensible and unbreakable code. We know, we’ve been there.

The immensity of data that exists in your AWS bill is more than any one person can possibly comprehend. There are millions of billing line items, one for every resource, for every hour and every operation. This leads you down a dark path filled with tracking spreadsheets and database consolidations that make Excel and Access explode. You search far and wide to find out how other people unravel this mystery. Meanwhile, your costs continue to grow… and you don’t know exactly why.

You ask your engineering leads to start tagging resources, so you can better understand the operational cost of your software products. Everyone is excited, finally a breakdown of operational cloud costs - and then someone introduces you to Reserved Instances (RIs)… and then something else ... and the cycle repeats, over and over.

What Do You Need?

Ideally, you’re trying to understand the costs associated with unique environments, business units, teams or products, to help you better account for and manage cost of goods sold, operational expenditures or a whole catalog of related financial accounting metrics. Perhaps you initially treated your cloud bill as a single cost center, but now with increased reliance on AWS, a more granular understanding of costs is paramount to your business. You need a new way to understand and control cloud costs.

For maximum benefit, you have to buy your RIs in your master account to reduce the cost of EC2 instances you’ll need in the future. But this introduces a cumbersome accounting problem. You have no reliable way to assign RIs to different cost centers. Because AWS decides how RIs are assigned, and you don’t, this will lead to allocations that are misaligned with your budget preferences. Even if you could allocate RI benefits according to your budget, you have a bigger problem. You can’t see RIs that go unused, which can be a huge financial waste.

Optimizing RI utilization and properly allocating costs are just two examples of the overarching problem of determining the real cost performance and effectiveness of large-scale AWS environments. For too many of us, this is an absolute nightmare.

How Most People Address Cost

AWS generates an hourly billing report, and it's packed full of data. The easiest analysis to do today is to average the cost data, in either daily, weekly or monthly intervals. If you plan to do this in Excel, monthly data pulls are likely your best bet, but be prepared for several hours of manual work for each analysis (typically why it's done only once a month). Or you can purchase a solution that will automate the average analysis, a peanut butter spread of costs.

Does this offer you a “true” representation of cost? Well no, but it’s better than nothing. Unfortunately, most engineers won’t accept this type of analysis because it isn’t precise and precision here matters. You can imagine the back and forth between finance and engineering when the numbers don’t add up, and if the analysis is suspect in any way, then the engineering team will poke holes in it.

What about RI distribution? How do you account for consumption? Many choose to “peanut butter” spread their total RI investment across each product or dev team and be done with it. The RI cost is treated as if the cloud environment is a single cost center. Is it accurate? Well not really, but it’s better than nothing. Again, while we’d generally agree with that, you still open yourself to cost reports not adding up between departments. Everyone wants to know the cost yet there is no agreed upon and trusted solution. It’s a nightmare.

And what happens? Well, finance, ops and engineering get together and the finger pointing starts - with no one to really blame. It’s a difficult problem to solve. Finance is frustrated because they don’t have a clear understanding of the real cost of goods sold, operations has a difficult time ascertaining which teams need additional support, guidance and resources, and engineering grows frustrated because they deem the billing reports inaccurate and they spend their time arguing why the analysis simply doesn’t make any sense.

There has to be a better way, right?

In A Perfect World

Okay, so what is it that we really need?

Imagine having a single, agreed upon (trusted) source for real-time, real-cost AWS billing data - broken out by business unit, team, product or perhaps even feature. Imagine having a real understanding of operational costs. Engineering could be predictive and initiate preventative measures based upon real-time cost anomalies, metrics, and trends. You could manage products and features against sales outcomes and adjust rapidly as needed to market forces.

Now, think what it would mean to do all of that, continuously. Having immediate and accurate analysis streaming to the right people. What would it mean if a technology could automate the distribution and accounting of reserved instances - amortized or blended in accordance with the organization's financial principles and operational ambitions?

Sounds pretty nice doesn’t it?

How To Become Perfect?

First and foremost - you need a way to properly blend the costs of reserved instance utilization - to get all development teams on the same level playing field. This is hard because there’s a lot of data you have to collect, normalize and streamline from the most granular billing data available.

Second, you need all of this information as soon as possible. For you to best impact business, the data must be accurate, easy to understand, and real-time. Analysis that is days, weeks or a month later, is much less impactful and actually demoralizing when the finger pointing begins.

Third, we need the granularity and richness of hourly (or better) data. Granularity is needed to understand your cost performance - both the good and bad. The data is there in the bill, we just need a new type of technology to make it useful.

Fourth, we need to be able to detect anomalous behavior and trends so that operational cost remains within our control. Granular and real-time billing data opens up new types of analysis that can connect costs directly to architectural design and system behavior. Cost doesn’t happen on its own; it results from system behavior.

Fifth and finally, we need to automate all of this and put the analysis into your hands, and those of a broad spectrum of users: engineers, operations, finance and business leaders, who can impact the business when they can easily see the state of their business operations.

Excel just isn’t enough.


This post was written in collaboration with Erik Peterson, and Alex Zavorski.