Table Of Contents
The Cost Illusion: What The Pricing Doesn’t Show Where The Costs Show Up: The Margins Problem The Rule Of 40 Meets The Cost Of AI Who’s Paying? Five Emerging Models The Need For A Sharper, More Focused Lens The Case For AI FinOps

The popular narrative around AI economics is changing. 

At one time, Moore’s Law conditioned us to expect that smarter, faster computing would steadily get cheaper. 

When it comes to AI, that expectation holds true at the unit level. Per-token costs are indeed declining. But the number of tokens consumed per task is growing exponentially, making total costs spike.

The tension here is important: on paper, inference is getting cheaper. According to Epoch.ai, token pricing is falling fast

Yet real‑world usage tells a more complicated story — one where falling per‑token cost is overshadowed by soaring total spend. Let’s take a deeper look at what these falling token prices are really hiding, and why that should matter to you.

The Cost Illusion: What The Pricing Doesn’t Show

As Andreessen Horowitz and Epoch AI both point out, the cost of LLM inference has dropped more than 10× per year in many cases. For tasks like summarization, classification, and simple Q&A, the per-token price keeps falling. Andreessen Horowitz calls this “LLMflation.”

And now, advanced workflows often burn hundreds of thousands or even millions of tokens per request.

Check out the token usage examples outlined by the Wall Street Journal:

Task TypeAvg Tokens
Basic Q&A50–500
Short summary2,000–6,000
Basic code assist1,000–2,000
Complex coding50,000–100,000+
Legal doc analysis250,000+
Multi-agent workflow1 million+

Why so many tokens? Because today’s models are evolved from the early days of ChatGPT mania in early 2023. They’re supercharged now.

These advanced LLMs don’t just generate a single output. They reason, loop, retry, and make decisions. They’re writing code, retrying failures (without additional prompting!), and chaining workflows autonomously.

What used to be a one-shot prompt is now a dynamic logic engine. Each step burns multiple tokens, and some models run through dozens or hundreds of steps per request.

A typical reasoning loop might look like this:

  1. Interpret the query
  2. Decide what tools or models to call
  3. Fetch data, run code, or call APIs
  4. Evaluate results, check for errors
  5. Retry, escalate, or replan
  6. Synthesize a final output

Agentic frameworks like AutoGPT or OpenAgents, tools like Cursor, and workflows inside Replit and Notion increasingly operate this way. These aren’t simple Q&A bots designed to tell you the weather or respond to simple queries. They’re autonomous systems executing long-running, multi-step logic.

Greater capabilities, but at what cost?

Here’s where the money comes in: These capabilities demand significantly more computational effort per task – and more spend. Token-heavy behavior drives up overall cost, even when models themselves are cheap.

What looks like unit-level efficiency can hide exploding costs at the task level. And that’s exactly what’s catching product and finance teams off guard. It’s even eating into margins. TechRepublic reported that Notion, for example, has seen its profit margins shrink by 10 percentage points as a result.

It’s like when energy gets cheaper, but you’re using more of it to run your modern home that’s packed with multiple energy-sucking devices including TVs, laptops, printers, battery chargers, plug-in hybrids, and so on. Your overall electrical bill goes up even if the per-unit cost is down. And for businesses operating in the AI world, that means an exponential spike.

The natural assumption, as suggested in the aforementioned Moore’s Law, is that smarter, rapidly evolving tech should get cheaper as it becomes more efficient. But AI is breaking that pattern because it’s more efficient. Complex chains of logic and recursive reasoning inflate token use to much higher levels.

For businesses shipping AI-native products, that blind spot is already showing up in the numbers. In fact, CloudZero’s The State Of AI Costs In 2025 report finds that just 51% of organizations can confidently evaluate AI ROI.

That’s due to unchecked token growth, which leads to margin erosion, unpredictable unit costs, and pricing exposure for AI-powered products.

The Cloud Cost Playbook

Where The Costs Show Up: The Margins Problem

For businesses building AI-native experiences like Notion, the situation is even riskier — and in some cases, financially perilous. 

As Business Insider recently reported, some platforms are discovering ‘inference whales‘. Those are users who consume tens of thousands of dollars worth of compute under flat-rate pricing models. One example cited showed a single developer consuming over $35,000 in compute under a $200 flat-rate plan. The mismatch created severe pricing exposure for the platform.

What else? For example, according to WSJ, Cursor’s AI-native users are exhausting usage credits within days. That suggests current pricing tiers are misaligned with actual compute consumption. 

WSJ also reports that Replit introduced “effort-based pricing” to curb usage. That then led to public backlash in Reddit and negative value perception, again affecting retention and recurring revenue.

The Rule Of 40 Meets The Cost Of AI

This isn’t just about edge cases, mitigation efforts, or inference whales. AI costs are exposing deeper flaws in how SaaS business models scale. AI helps you grow faster, but also eats into your margins. 

That means a Rule of 40 paradox for SaaS businesses. The Rule of 40 combines growth rate and profit margin into a single score, and this AI cost inflation hits both sides of that rule, dragging total scores below sustainability thresholds.

T3 Chat CEO Theo Browne puts it succinctly in the WSJ article: “The arms race for who can make the smartest thing has resulted in a race for who can make the most expensive thing.”

This is where traditional financial heuristics start to break down. If your AI features boost revenue but cannibalize profitability, you’re not just straining your infra budget. Y‚ou’re undermining your growth narrative.

As margins tighten, companies are being forced to get creative and pragmatic about how they price and deliver AI.

Who’s Paying? Five Emerging Models

Companies are experimenting with different strategies, a pattern documented in multiple recent reports:

1. Enterprise eats it

Some large SaaS platforms and hyperscalers are absorbing inference costs to build strategic moats — gaining adoption while buying time to optimize cost structure. Notion and GitHub Copilot are examples.

2. Customer pays

Others are shifting costs directly to customers, either by raising prices or implementing metered usage. Flat-rate models have proven risky, especially when a small number of users drive disproportionate compute consumption.

3. Go dumber

Some platforms are implementing dynamic model routing, sending simple requests to lightweight models and reserving high-performance models for only the most demanding tasks. This helps contain costs while maintaining performance where it counts.

4. Hardware offloading

Others are investing in inference-specific hardware like accelerators or custom silicon to reduce cost-per-output at the infrastructure level.

5. Usage shaping & guardrails

Usage shaping tactics such as retry caps, depth limits, and API throttles are emerging as must-have architectural controls for AI cost management. These patterns mirror classic cloud governance practices from FinOps, adapted to AI workloads.

Company responses vary, but the root cause remains: a lack of visibility into what AI-propelled mechanisms actually cost.

The Need For A Sharper, More Focused Lens

This makes unit economics more than a back-office concern. It becomes a strategic lens for sustainable product development. Understanding cost per workflow or per customer isn’t just a FinOps metric. It’s a strategic imperative. It’s foundational to making AI a viable business model. 

Without this granularity, you’re scaling usage without clarity, and growing revenue without protecting your margin. In this new paradigm, flying blind on AI ROI means inviting financial debt in exchange for ephemeral growth.

CFOs and product leaders must move beyond blended cost metrics to uncover the true unit economics of AI-powered features. What matters now is fine-grained visibility. Tracking cost by logic loop, by token pathway, by agent behavior. AI observability can’t stop at GPU utilization; it must map directly to economic value creation.

If one thing is clear, the key is building intelligent cost architecture: optimize where it matters, contain tasks, and treating token spend as a constrained resource, to be allocated like compute or storage. 

This isn’t about cutting features. It’s about building AI systems where cost is a design input, not an afterthought:

  • Prioritization: Reserve top-tier models for tasks that truly demand them, while routing simpler tasks to leaner models.
  • Guardrails: Enforce caps, retries, and budget-aware agent design to contain careening costs.
  • Observability: Track token flows at the level of workflows, users, and business functions.
  • Governance: Align usage with business priorities, ensuring spend translates into measurable value.

In other words, intelligent cost architecture treats inference as a scarce resource to be allocated strategically, much like compute or storage in the early days of cloud FinOps.

The Case For AI FinOps

To bring clarity and control to these architectural tradeoffs, teams are increasingly turning to a new discipline: AI FinOps.

Just like Moore’s Law lulled tech leaders into expecting ever-cheaper compute, AI’s efficiency curve has done the same with inference. But the illusion is breaking. We’re no longer in a linear cost decline. We’re in a compound complexity explosion.

Our free AI Cost Optimization Playbook provides a framework to get started. Meanwhile, note these new requirements for AI-aware cost control:

  • Token-level observability: View token use by user, model, task, and time.
  • Per-workflow cost breakdowns: Know what each function actually costs.
  • Effort-based forecasting: Estimate cost-to-serve based on request complexity.
  • Budget-aware agent design: Prevent infinite loops or redundant calls.
  • Model routing dashboards: Auto-select models based on cost/accuracy tradeoffs.

This approach is often referred to as this AI FinOps: the practice of aligning AI infrastructure spend with business value.

It’s not just about “spend less”. It’s about seeing what’s being spent, where, and why. That visibility creates control. And control protects margin.

Don’t let falling token prices lure you into expensive habits. Start treating token spend as a scarce, governable asset and design AI systems with cost at their core. That’s the true path to AI‑powered growth that actually pays off. 

In the new AI economy, margin discipline is the moat. Start building it now.

The Cloud Cost Playbook

The step-by-step guide to cost maturity

The Cloud Cost Playbook cover