On March 18, Jensen Huang took the stage at NVIDIA’s GTC conference in San Jose for a keynote that ran well over two hours — covering everything from CUDA’s 20-year history to humanoid robots that may one day wander Disneyland.
But buried inside the spectacle was a remarkably clear-eyed articulation of the economic forces now bearing down on every enterprise that builds on cloud infrastructure.
For engineering executives, FinOps practitioners, and CFOs trying to get ahead of AI inference costs, this keynote was a roadmap: not just for where NVIDIA is going, but for the cost management and accountability problems that are about to get significantly harder to ignore.
Here are the most important takeaways.
1. The inference inflection is here — and your budget model isn’t ready for it
Huang was unambiguous: the era of AI training spend as the dominant cost driver is over. Inference is the workload now, and it’s scaling fast.
“AI now has to think. In order to think, it has to inference. AI now has to do. In order to do, it has to inference.”
He argued that the compute demands of the work have increased roughly 10,000x in two years, and usage has grown roughly 100x on top of that. “I believe that computing demand has increased by 1 million times in the last 2 years,” Huang told the crowd. That’s not a rounding error. That’s a new cost regime.
“I believe that computing demand has increased by 1 million times in the last 2 years.”
For finance leaders, this means the budget models built around training runs are the wrong lens. Training spend is discrete, bounded, and somewhat foreseeable. Inference is none of those things.
It’s continuous, variable, and directly tied to product usage. Every time a user invokes an AI feature, tokens get generated and cost accumulates. The more successful the product, the higher the bill. That’s a fundamentally different cost profile than anything traditional FinOps frameworks were designed to handle.
CloudZero’s Cloud Economics Pulse bears this out: while enterprise survey respondents report allocating 30–36% of cloud budgets toward AI workloads, actual measured AI spend on cloud bills sits closer to 2.5%.
The gap isn’t because organizations aren’t spending. Inference costs are embedding themselves inside compute and storage line items that were never designed to surface them. By the time finance teams see the number, it’s already happened.
Part of the reason is structural: there’s no shared framework for what AI output actually costs. Huang spent significant time laying one out.
2. Tokens are the new commodity — every business needs to price them
One of the most striking moments in the keynote was when Huang laid out a token pricing framework directly: free tier, $3 per million tokens, $6, $45, $150, with each tier corresponding to a different model size, context length, and response speed. He wasn’t describing NVIDIA’s pricing. He was describing the economics every enterprise AI service will eventually adopt.
“Tokens are the new commodity. And like all commodities, once it reaches an inflection, once it becomes mature or becomes maturing, it will segment into different parts.”
“Tokens are the new commodity. And like all commodities, once it reaches an inflection, once it becomes mature or becomes maturing, it will segment into different parts.”
This has immediate implications for unit economics. If tokens are the unit of AI output, then cost per token becomes the metric that matters. More importantly: revenue or business value per token.
Organizations that can’t attribute token consumption to products, teams, or business outcomes will have no way to answer the question every CFO is about to ask. Are we generating more value than we’re spending to generate it?
This maps directly to what CloudZero’s Director of Finance, Emily Allen, discovered using the CloudZero MCP integration with Claude Code. AI API costs were climbing fast. The only visibility came from manually copying a credit card statement into a spreadsheet.
Ten minutes of plain-English prompting later, she had identified that Opus model adoption had grown more than 2,000% and found $13,000–$30,000 in monthly savings by modeling a shift from Opus to Sonnet for workloads where the premium wasn’t justified.
Tokens are already a commodity. Cost per token is already a business decision.
Understanding that cost starts with understanding the infrastructure producing the tokens — and that infrastructure has a hard ceiling.
3. Your AI factory is power-constrained — and that has real cost consequences
Huang kept returning to a single physical constraint: a data center has a fixed power envelope, and that’s it. Given that, every architectural decision becomes a question of token yield per watt.
“You still have to build a gigawatt factory. And that gigawatt factory for 15 years amortized across that gigawatt factory is about $40 billion. Even when you put nothing on, it’s $40 billion in.”
“Even when you put nothing on, it’s $40 billion in.”
Most enterprise teams aren’t building gigawatt data centers. But the principle scales.
Whether you’re running inference on reserved GPU capacity, on-demand cloud instances, or a mix of both, you have a fixed spend envelope. The question is how many useful outputs you’re extracting from it.
Teams that treat GPU spend as a black box, paying for capacity without measuring what it produces, are flying blind. As Huang made clear, the cost of the wrong architecture isn’t just a performance problem. At scale, it’s tens or hundreds of millions of dollars in infrastructure that can’t recover its investment.
This is also where code-level waste compounds the problem. Individual workloads that are misconfigured, over-provisioned, or simply forgotten don’t trip alarms in isolation. But inside a power-constrained infrastructure where every watt needs to produce tokens, that invisible waste has a direct cost to throughput and to revenue. We’ve written about exactly this dynamic before.
4. Claude Code changed the demand equation — and your engineering org just got more expensive
Huang gave notable credit to Claude Code as one of three inflection points that drove the current explosion in compute demand, alongside ChatGPT’s generative AI breakthrough and o1’s reasoning capability — which he said allowed AI to “reflect” on problems, plan solutions, and decompose what it couldn’t previously solve.
“Claude Code has revolutionized software engineering, as all of you know. 100% of NVIDIA is using a combination of — or oftentimes all 3 of them, Claude Code, Codex and Cursor — all over NVIDIA. There’s not one software engineer today who is not assisted by one or many AI agents helping them code.”
For engineering leaders, this is a signal and a cost driver simultaneously.
Agentic coding tools aren’t just productivity multipliers. They’re token consumers. At scale, a development organization running Claude Code across every engineer generates a meaningful and growing API spend that lives nowhere on the traditional cloud bill.
It shows up in an Anthropic invoice, embedded in a SaaS vendor’s pricing, or quietly multiplied across team-level API keys that nobody centrally tracks.
Huang even suggested that engineering leaders will soon compete for talent partly on the size of the token budget they offer employees. Two years ago that was science fiction. Today it’s a recruiting line item.
If that spend isn’t centrally visible and attributable, it’s already ungoverned.
According to CloudZero’s FinOps in the AI Era report, 91% of companies are already embedding AI in their products. The API spend question isn’t theoretical for most engineering organizations. It’s already on next quarter’s P&L.
CloudZero’s Claude Code plugin was built directly for this — giving engineering teams real-time cost visibility into their agentic coding sessions, attributed to the teams and workflows generating them, without requiring anyone to leave their development environment.
5. Shadow AI spend is already happening — you just can’t see it yet
The same dynamic that created shadow IT is playing out with AI tooling.
Employees and teams are spinning up API subscriptions, experimenting with models, and generating token spend without central visibility. Finance and IT leaders are realizing they don’t have a full picture of what’s being spent, who’s responsible, or why.
Huang’s framing of the token budget as a forthcoming standard employee benefit makes this more urgent, not less. If every engineer gets a token allowance as part of their compensation, who’s spending what, and on which models, becomes a core finance and HR governance question, not just a technical one.
Emily Allen’s experience at CloudZero is instructive here. Before the MCP integration, all AI API costs were accumulating on a single credit card. Visibility came from a manually copied spreadsheet.
That’s not an unusual setup. That’s how most organizations handle AI vendor spend today. And it produces exactly the cost surprises Huang is describing at planetary scale when the aggregate inference bill lands.
The fix isn’t complicated. It requires the same thing that solved the shadow IT problem a decade ago: granular attribution before the spend becomes unmanageable, not after. Cloud cost anomaly detection was the answer then. Visibility into AI API spend by team, model, and workload is the answer now, but the scope of what needs to be visible is about to expand dramatically.
6. Agentic systems are about to reshape your entire IT cost structure
The most forward-looking portion of the keynote wasn’t about chips. It was about OpenClaw, an open-source agentic framework that Huang compared, repeatedly, to Linux and HTML in terms of its potential industry impact.
The framing matters for cost leaders: agents access sensitive information, execute code, and communicate externally. They consume tokens continuously, operate autonomously, and in Huang’s vision, will sit at the center of every enterprise IT stack within a few years. Not as assistants. As workers.
“Every single SaaS company will be becoming a [GaaS] company and Agentic-as-a-Service company.”
Every agentic workflow generates token consumption that needs to be attributed somewhere: querying a database, calling an API, generating a report, sending a notification. To the team that configured the agent? The product it served? The business unit it acted on behalf of? These questions don’t have good answers yet, and the frameworks FinOps practitioners built for tagging EC2 instances don’t map cleanly onto autonomous multi-step workflows.
According to CloudZero’s FinOps in the AI Era report, FinOps maturity doubled year over year, but efficiency scores collapsed at the same time. Organizations got better at the process while losing ground on the outcomes. Agentic systems are about to make that gap wider, not narrower, for teams that don’t proactively build cost attribution into their agent architecture from the start.
7. Cost attribution for shared AI infrastructure is a governance gap waiting to explode
Agentic workflows are one layer of the attribution problem. The infrastructure underneath them is another.
Huang’s description of AI factory economics makes one thing clear: shared infrastructure is the norm, not the exception. GPU clusters serve multiple teams, multiple models, and multiple workloads simultaneously. Traditional chargeback and showback approaches weren’t built for that.
For most organizations today, shared AI infrastructure means no one clearly “owns” the bill. It gets absorbed as overhead, split evenly by headcount, or attributed to the team that provisioned the cluster regardless of who’s actually consuming it.
None of those approaches hold up when the bill grows fast.
Huang was direct about what’s at stake: “No matter what happens, you still have to build a gigawatt data center. You better make for darn sure you put the best computer system on that thing so that you could have the best token cost.”
The same logic applies at any scale. You’re committed to the infrastructure cost whether or not you can justify what’s running on it.
The principle is the same one that makes cloud cost allocation valuable for traditional cloud: tying spend back to the business contexts consuming it. It applies directly to shared AI infrastructure. And that architecture needs to be in place before the spend is unmanageable — not after the CFO puts it on the QBR agenda.
8. Multi-cloud and multi-vendor AI complexity is compounding fast
Huang touched on the expansion of AI infrastructure into sovereign deployments, regional clouds, on-prem installations, and space-based data centers. The practical upshot for enterprise architecture: the multi-vendor, multi-cloud reality is not simplifying. It’s deepening.
“Whether it’s a data center, cloud, on-prem, at the edge, on a robotic system — all of those computing systems are different,” Huang said. “AI could be deployed literally everywhere.”
When AI workloads run across AWS, Azure, GCP, private cloud, and edge nodes simultaneously, each with different pricing models, attribution mechanisms, and contractual structures, consolidated cost visibility becomes genuinely hard. Negotiating leverage is fragmented. Benchmarking efficiency across environments is inconsistent.
CloudZero’s Cloud Economics Pulse data shows that organizations are increasingly committing AI budgets before the infrastructure decisions are fully resolved. They’re locking in spend against a multi-vendor architecture they don’t yet have consolidated visibility into.
CFOs keep escalating this exact problem. It doesn’t get better without a single view of spend across all environments — and the multi-cloud picture is only getting more complicated.
9. The forecasting problem is structural — and it’s about to get worse
Every point Huang made ultimately resolves to the same underlying challenge: AI infrastructure costs are variable, architecturally complex, and tied to usage patterns that are still evolving. That makes them nearly impossible to forecast with traditional methods.
Huang put the demand signal bluntly. “Right here where I stand, I see through 2027, at least $1 trillion,” he told the crowd. “I am certain computing demand will be much higher than that.”
For individual enterprises, the equivalent signal is clear: AI spend trajectories are unlikely to flatten, and the variance around any forecast is wide.
Finance leaders need a model that accounts for inference variability, model tier selection, agentic workload scaling, and multi-vendor commitments simultaneously. That’s not a spreadsheet problem.
And it’s not just the big infrastructure bets that erode forecasting accuracy. Misconfigured jobs, forgotten workloads, and quietly recurring charges aggregate until a quarterly review reveals a number nobody can explain. That pattern scales from a $2,000 CronJob to a seven-figure AI infrastructure line item. The only difference is visibility.
The organizations that will forecast AI costs accurately are the ones building granular attribution now, before the scale makes it unmanageable.
The bottom line
Jensen Huang spent two hours describing the hardware. Most headlines will focus there too. The more important story for enterprise leaders is the economic model underneath it — and whether their organizations are ready for it.
Tokens are the new unit of AI output. Inference is the dominant cost driver. Agents are about to multiply token consumption across every function in your organization. And the cost management frameworks built for traditional cloud were not designed for any of this.
The organizations that win the next five years won’t just be the ones that deploy AI fastest. They’ll be the ones that can answer, with specificity, whether it was worth it: cost per inference call, cost per agent task, token spend per engineering team, AI margin by product line, cost per anything. That’s not a future problem. It’s already the problem.
Want to understand what your AI infrastructure actually costs and whether it’s paying off? Read the FinOps in the AI Era report
Ready to get visibility into your AI and cloud spend? Schedule a demo.


