← Writing

· 11 min read

The economics of AI inference will define the next decade of software

Most public conversations about AI in 2026 are still about capability. What can the new model do? How does it score on this benchmark? Which lab is ahead this quarter?

The conversations that actually matter inside companies running production AI are different. They’re about economics. Can we afford to do this at our traffic volume? What does our unit cost look like at 10x? Which features are profitable to ship and which ones quietly aren’t?

There’s a gap between those two conversations, and it’s getting wider. The public discourse is fixated on the capability frontier; the operational reality is fixated on the cost frontier. And as is usually the case in technology, the operational reality is where the actual industry shape is being decided.

Every previous platform shift went through the same evolution. Cloud computing started as can we run this somewhere besides our own datacenter? and became can we run this profitably at scale? Mobile started as can we put a usable app on a phone? and became can we ship features without burning the device’s battery and bandwidth? In each case, the companies that solved the second question early ate the market created by the first.

AI is at the same inflection point now. The cost frontier is moving from a peripheral concern of infrastructure teams to a central concern of product and engineering leadership. And most teams are not yet organized to operate at that frontier.

The thesis I want to argue here is simple: inference economics is now an architectural problem, not just an infrastructure problem — and most teams haven’t reorganized to reflect that. The implication is that the next decade of software will be shaped less by which model wins benchmarks and more by which teams figured out the unit economics first.

Why the conventional framing falls short

The standard playbook for AI cost optimization is infrastructure-led. Cheaper GPUs. Smaller model variants. Quantization. Better batching. Improved caching at the inference layer. All of these are useful. All of them are necessary at scale. None of them, alone or together, deliver the kind of order-of-magnitude wins that change a product’s economics.

The reason is structural. The dominant cost driver in production AI workloads is rarely the per-token cost of any single inference call. It’s the number and shape of inference calls the system makes — and that’s determined by architecture, not by infrastructure.

A few framings make this concrete.

The decision of which model to call is more economically consequential than how cheaply you serve any given model. Routing 70% of requests to a model an order of magnitude cheaper than your default produces savings that no quantization scheme can match.

The decision of whether to call a model at all — versus consulting a cache, applying a heuristic, or using a deterministic fallback — is more consequential still. A well-designed system avoids the inference call entirely for a meaningful fraction of its traffic.

The decision of how to compose model calls is more consequential than either. Replacing a single large-model call with a routed pipeline of smaller specialized calls routinely produces 5-10x cost differences for equivalent task quality.

Infrastructure optimization plays for 10-30% wins. Architectural optimization plays for 5-10x wins. The two are not comparable in magnitude, and treating them as the same kind of problem — both filed under “AI cost optimization” — is precisely why so many cost initiatives plateau after the easy infrastructure wins are taken.

Five architectural shifts that actually move the needle

The teams I see consistently shipping cost-efficient AI tend to operate from a small set of principles. None of them are individually novel; the discipline is in applying them together, deliberately, as architectural commitments rather than as occasional optimizations.

1. Right-size before you optimize

The fastest cost win is almost always model selection — not infrastructure tuning. A smaller model that’s good enough for the task beats an over-provisioned large one every time, and most teams discover this much later than they should.

The reason teams default to oversized models is rational in isolation: when you’re shipping a feature for the first time, you want maximum capability headroom. The largest, most capable model is the safest engineering choice in the moment. The problem is that the choice rarely gets revisited. The feature ships, traffic grows, and the team is now paying premium-tier inference costs for workloads that a model an order of magnitude cheaper could have handled fine.

The discipline that helps here is treating model selection as a measured decision, not a default one. For each meaningful workload, run an honest evaluation against two or three smaller candidate models. The result is often surprising: a substantial fraction of production traffic runs perfectly well on cheaper models. The teams that internalize this early are running a fundamentally different cost structure than the ones that don’t.

This is what people are starting to call the small-model renaissance — the recognition that capability headroom is a luxury most workloads don’t actually need, and that paying for it indiscriminately is a structural disadvantage.

2. Route, don’t run

A lightweight classifier deciding which model handles a request beats a single large model handling everything. This pattern alone often delivers more savings than any single infrastructure optimization, and it scales gracefully as the underlying model landscape evolves.

The architecture is straightforward: a small, fast classifier (sometimes itself a tiny model, sometimes pure heuristics) examines each incoming request and routes it to the appropriate downstream model. Simple requests go to cheap models. Complex requests go to capable ones. Edge cases go to specialized ones. The classifier itself costs almost nothing to run at scale, and the savings on the routed traffic compound across every request.

What I find interesting about this pattern is that it’s quietly becoming the new application server. Twenty years ago, the highest-leverage piece of architecture in a web application was the load balancer in front of your stateless app servers. Ten years ago, it was the API gateway in front of your microservices. In AI systems, it’s increasingly the routing layer in front of your model calls. The piece that decides what to call matters more than the piece that does the calling.

This pattern also future-proofs your system in a way that infrastructure optimization doesn’t. As new models arrive — cheaper, faster, more specialized — a routing layer lets you adopt them surgically, swapping them in for specific traffic slices without rewriting your application code. Teams without a routing layer end up with model choices baked into business logic, and changing them is painful enough that they often don’t.

3. Cache as a first-class system

Production AI traffic has staggering redundancy. The same questions, asked in slightly different words, by different users, in different contexts, often resolve to the same answer. The same documents get summarized repeatedly. The same workflows get triggered with overlapping inputs. Once you instrument honestly for this, the redundancy is hard to unsee.

Most teams under-invest in caching for AI workloads. The reflex from traditional software is that cache invalidation is hard, that AI outputs are non-deterministic, that semantic similarity is fuzzy. All true, and none of it is a reason to skip caching — only a reason to treat it as a real system rather than a quick optimization.

A serious AI caching layer is closer in spirit to a search index than to a Redis instance. It uses embeddings to identify semantically similar requests. It has clear policies about what’s cacheable and what isn’t. It tracks hit rates as a first-class metric. It includes mechanisms for staleness, invalidation, and eviction that are tuned to the specific characteristics of AI traffic rather than borrowed wholesale from web caching.

When this is done well, hit rates of 30-50% on production traffic are common. The economic implication is direct: if half your inference traffic is being served from cache at near-zero marginal cost, your effective per-request cost is half what it would otherwise be. There is almost no infrastructure optimization that produces an equivalent win.

4. Async by default for non-interactive work

“Real-time” is a constraint we often put on ourselves rather than one our users actually require. A large fraction of AI workflows in production are not, when examined honestly, time-sensitive in the way the architecture treats them. Background summarization. Periodic enrichment. Batch classification. Document processing. On-call assistance. Many of these run as synchronous, latency-optimized workflows because that’s how they were first built — not because anyone benefits from the latency.

Treating these workflows as asynchronous unlocks an order of magnitude in cost. Async workflows can batch requests across users, share inference capacity across time, defer expensive operations to off-peak windows, and use lower-priority compute classes. Each of those mechanisms is individually significant; together they reshape the cost curve.

The discipline here is asking, for each workflow, who actually waits for this? If the answer is “the user is staring at a loading spinner,” real-time is the right design. If the answer is “the result shows up in a notification, an email, a dashboard, or a workflow downstream,” async is almost certainly cheaper, more reliable, and more scalable.

A surprising number of “real-time” AI features turn out to be the second category. Reclassifying them is one of the highest-leverage moves a team can make, and it usually requires no model changes at all.

5. Inference cost as a first-class observability metric

You cannot optimize what you do not measure. Most teams running production AI track latency and error rate as core metrics; far fewer track cost-per-request with the same rigor. The result is a system whose performance is observable in two dimensions and whose economics are observable only in monthly bills.

A first-class cost metric is per-request, attributed to specific workflows or features, and surfaced in the same dashboards engineers look at every day. When a team can see that workflow A costs $0.012 per request and workflow B costs $0.31, the conversation about which one to prioritize for optimization writes itself. When the cost data is buried in a finance dashboard checked once a month, that conversation never happens.

The teams that take this seriously develop something like a culture of cost-awareness — not in a penny-pinching way, but in the same way good teams develop a culture of latency-awareness or reliability-awareness. Cost becomes part of how engineers reason about their work, not an externality managed by someone else.

The deeper shift: inference cost is a product decision

The five principles above are useful, and applying them rigorously will materially change a team’s cost structure. But there is a deeper insight underneath them, and it is the part of the argument I want to leave you with.

Inference cost is a product decision before it’s an engineering one.

The decisions that actually determine a system’s inference economics are not infrastructure decisions. They are decisions about what the product does, when it invokes AI, what quality threshold counts as “good enough,” how much latency users will tolerate, which features are worth what level of capability. These are product decisions. They get made — explicitly or implicitly — by product managers, by founders, by the implicit defaults of how features are scoped. And they constrain the engineering team’s cost ceiling far more than any architectural choice the engineering team makes downstream.

The pattern I keep seeing is that organizations under-resource the product side of inference economics. Engineers heroically optimize the architecture, the routing, the caching, the model selection — compensating for product decisions that should have been made earlier and differently. The team that ships a feature requiring real-time large-model calls for every user interaction has put a cost ceiling on themselves that no amount of infrastructure work can fully recover.

The companies I see winning at AI economics tend to have someone who explicitly owns inference unit economics. Sometimes a product manager. Sometimes a tech lead. Sometimes an architect with a product mandate. The role’s title varies; the function is the same. They sit in the conversations where features are scoped and ask the questions that determine the system’s eventual cost shape. Do users actually need this in real-time? What quality threshold is the minimum acceptable result? Is there a workflow design that requires fewer or cheaper model calls?

The companies I see struggling tend to have everyone slightly responsible for cost and no one fully responsible. Engineering thinks product is making the architectural decisions. Product thinks engineering is handling cost. Finance sees the bill at the end of the month. The decisions that determine the cost structure get made, but they don’t get owned — and the absence of ownership shows up directly in the burn rate.

This is the part of the argument that engineering leaders, in particular, should sit with. Optimizing inference costs after a feature ships is a fraction as effective as shaping the feature’s design before it ships. If your AI cost initiative is mostly architectural retrofitting, you are working on the second-order problem. The first-order problem is who is in the room when the feature is scoped, and what questions they are equipped to ask.

Why this matters for the next decade

Step back and look at where the AI infrastructure layer is heading. The trajectory is clear: many companies will offer cheaper inference, faster inference, more specialized inference. The compute layer will get crowded and competitive. Margins on raw inference are already compressing and will compress further. That trend is inevitable, and it’s the easy part of the problem.

The hard part is system design that exploits inference economics intelligently. That isn’t a problem any vendor can solve for you. It’s an architecture-and-organization problem, owned by the engineering and product leaders building the application. And the gap between teams that solve it well and teams that don’t is going to widen significantly over the next several years.

This has direct implications for what kinds of engineering leaders will be disproportionately valuable in this period. The leaders who develop deep intuition for the interplay between model selection, system architecture, product scoping, and operational cost — who can hold all four in their head at once and reason fluently across them — are going to be the ones who can credibly drive AI strategy at companies serious about shipping AI profitably. That intuition does not develop by accident, and it does not develop by reading about the capability frontier. It develops by sitting close to the work, instrumenting honestly, and treating economics as a first-class concern of system design.

Capability will continue to matter. The model frontier will continue to advance, and there will continue to be reasons to use the most capable model for specific high-value workloads. But for the broad middle of production AI traffic — which is where most of the dollars in this industry will be spent — the question is not which model is most capable. The question is which architectures, which organizations, and which leadership instincts produce systems that are economically sustainable at scale.

Closing

Most public AI commentary will keep being about capability for some time. That’s where the headlines are. But the conversations that determine industry shape are happening elsewhere, in product and engineering meetings where teams are deciding which features to ship, how to architect them, and who owns the resulting cost.

The teams that treat inference economics as an architectural problem, owned by someone with both product and engineering authority, are going to look like magicians for the next several years. They’ll ship more features at lower cost, weather model price changes more gracefully, and have more strategic flexibility than competitors who treated cost as an afterthought. The teams that don’t will be quietly capacity-constrained by their own architecture, wondering why their AI bills keep climbing faster than their revenue.

If you’re working on this, I’d be curious to hear where the hardest tradeoffs are showing up in practice. What’s the inference economics decision your team is wrestling with right now? Where is the gap between product scoping and engineering execution showing up most painfully?

You can find more of what I’m working on at /now, or reach out directly through /contact.