milebits
10 min readagents · cost · operations · production

The real operational cost of AI agents

Token bills are the visible part of the cost. The bigger numbers are hidden in retries, fallbacks, conversation context growth, and cost accounting nobody set up. Cost discipline is an architecture decision, not an optimisation.

The conversation starts the same way every time. A team has shipped an agent. It works. They are proud of it. Six weeks later, somebody on the finance side notices a model bill that used to be $400 a month is now $5,200 a month, and the growth curve does not look like it is plateauing. Engineering opens the model provider dashboard, scrolls through the calls, and cannot tell which feature is responsible for which fraction of the bill.

The first reaction is usually to blame token costs. They are rarely the whole problem. They are the visible part of the problem. The actual operational cost of an AI agent is a stack of things, most of which the team has not been measuring.

If you are building agents now, or you have already shipped one and the bill is starting to feel uncomfortable, this is the failure shape, with the order we would address it in.

   ┌─ Visible bill ─────────────────────────────┐
   │  model calls                               │
   └────────────────────────────────────────────┘
                       │
                       ▼  what most teams measure

   ┌─ Hidden multipliers ───────────────────────┐
   │  retries · fallbacks · long context        │
   │  eval runs · embedding refreshes ·         │
   │  runaway loops                             │
   └────────────────────────────────────────────┘
                       │
                       ▼  what actually drives the bill

   ┌─ Control layer ────────────────────────────┐
   │  cost labels · budgets · tier routing      │
   │  caching · anomaly alerts                  │
   └────────────────────────────────────────────┘

Token spend is the visible layer. The hidden multipliers are what blow up the bill. The control layer is what keeps the system economically predictable.

The model bill is not the whole bill

Token costs are real. They are also bounded. A well-architected agent serving a thousand conversations a day on a $3-per-million-tokens model lands in the low hundreds of dollars per day. That number is uncomfortable but not catastrophic.

The numbers that hurt are the ones nobody is watching:

Retries on tool failures. Your agent calls a tool, the tool times out or returns a malformed response, the framework retries. Each retry is another full model call to decide what to do next. We have seen agents that retry up to five times by default on transient failures. If a flaky tool is hit thousands of times a day, retry tokens alone can become a meaningful daily cost line, separate from the primary work.

Model fallbacks. The team set up a fallback: if the cheap model fails, escalate to the expensive one. Then one day the cheap model started returning malformed JSON ten percent of the time after a provider update. Now every tenth request is being silently re-run on a model that is fifteen times the price. Nobody noticed because the latency was acceptable and the responses still looked right. The bill noticed.

Embedding refreshes. The team set up a nightly job to re-embed the knowledge base. That job ran without anyone looking at it for three months. The corpus grew. The model used got more expensive. The total nightly embedding cost crept up to $30 a night and is currently the third-largest line item.

Evaluation runs. The eval harness, which is good, has been running on every PR plus nightly plus a weekly deep run. That is a real bill. We have seen evals consume forty percent of the total model spend on a small team, mostly because nobody had thought to optimise eval runs the way they optimised production runs.

These costs are not bugs. They are the cost of running an AI system in production. The problem is not that they exist. The problem is that they are not labelled. The team cannot tell you what fraction of the monthly bill comes from each one. So when the bill jumps, the team cannot tell what changed.

Cost attribution is the first piece of infrastructure

Before you can manage cost, you have to be able to see it.

Every model call that goes through your system should be tagged with: the feature it served, the tenant or customer it served, the user-visible flow it was part of, the type of operation (primary generation, retry, fallback, eval, background job). This information has to land in your logs. It has to be queryable.

The standard pattern is wrapping the provider SDK with a thin layer that adds these labels to every request. Helicone and Langfuse both do this if you do not want to write the wrapper yourself. Once it is in place, the question "what cost us $1,200 last week" has a real answer instead of a guess. The team can look at the dashboard and see that customer X with their high-volume use case is responsible for $400, the eval harness is $300, and a flaky tool is responsible for $200 in retries.

You cannot triage what you cannot see. Cost attribution is the seeing layer. Without it, every cost conversation becomes argument from memory and belief.

Build this first. Not the agent. Not the prompts. The cost attribution. We usually put the labels in place during week one, alongside the eval harness, then tune budgets, tiering, caching, and alarms in week two or three as the system starts running real load. It is too easy to defer and too important to skip.

The conversation-length problem

Agents that hold a conversation have a specific cost shape that catches teams off-guard.

Every turn in a conversation includes the prior turns in its context. A ten-turn conversation has each turn carrying more context than the last. The first message is roughly N tokens. The fifth message is roughly 5N. The tenth message is roughly 10N. The token cost of a long conversation grows linearly with the number of turns, but the cumulative token spend across the conversation is quadratic.

This is fine for short conversations. It is brutal for the ones that do not end.

A few patterns help:

Summarisation at fixed points. After every six or eight turns, the agent generates a short summary of the conversation so far and the next turn starts fresh with the summary instead of the full history. Done well, this looks the same to the user and cuts the token spend on long conversations by half or more.

Tool-call elision. The history sent to the model does not need to include the full tool-call inputs and outputs. The fact that "the agent called the lookup tool and got customer ID 47291" can be reduced to a single line by the time the third turn rolls around. Teams that send the raw tool transcripts on every turn are paying for context they do not need.

Per-conversation token budgets. If a conversation goes past N tokens, the system summarises aggressively or hands off to a human. This is the same logic as a token-per-request budget, applied to the whole conversation.

The point is not to optimise every conversation. The point is to know what your conversations actually cost on average, what the long tail costs, and where the cost line is past which a conversation stops being worth keeping the model in.

Model tiering is real architecture

Most teams start with one model for everything. That is understandable early. It gets expensive once volume arrives. The work an agent does is not uniform in difficulty, and the cost of using a single expensive model for all of it adds up fast.

The pattern we default to has three tiers.

Cheap and fast for the high-volume work that does not require deep reasoning. Classification of what kind of question this is. Lightweight extraction. Confirmations. Routing decisions. A model that costs a tenth of the premium tier and runs in 300ms is appropriate for these. They are sixty percent of the agent's call volume and should be sixty percent of the agent's call volume on the cheap model.

Medium tier for most generation. Most actual answers, most tool selection, most user-facing synthesis. The middle of the road is usually fine. The premium tier is overkill.

Premium tier for the hard work. Multi-step reasoning. Generation that requires precise citation. Tasks where the failure mode is the model making something up. The premium model exists for these cases and should be used for these cases only.

The trick is the routing layer. The classifier that decides which tier to use is itself a small model call. It has to be cheap enough that the savings from tiering exceed the cost of routing. We have seen teams build elaborate routing layers that cost more than just using the premium model for everything would have. Measure first.

Caching is the other half of model tiering. Cached responses avoid another model call entirely. The pattern is to identify the requests that are likely to repeat (identical or near-identical queries, common embeddings, common tool calls), cache them at the response layer with an appropriate TTL, and only call the model on cache miss. For some agents, cache hit rate is one of the biggest available cost levers. We have seen RAG systems where a meaningful fraction of queries had been answered before and the team was paying full price for each one.

The runaway loop at 2am

One of the most expensive failure modes in production agents is the runaway loop. The agent gets stuck in a tool-call cycle and keeps calling itself until something stops it. If nothing stops it for six hours, you wake up to a five-figure bill from a single conversation.

The prevention is operational, not algorithmic.

Per-conversation token budgets. Set the budget. Enforce it server-side. When the budget is hit, the conversation terminates with a fallback response, the user gets routed to a human, and the incident is logged. Do not rely on the model to know when to stop.

Maximum tool-call depth. A single conversation should not be able to invoke more than N tool calls without human intervention. The right N depends on the task. For most agents it is somewhere between five and fifteen. Higher than that and you have either an unusual workflow or a loop.

Cost alarms tied to your Slack. Per-tenant cost crossing a threshold should ping somebody within an hour, not surface in the monthly bill review. The threshold should be set based on what is normal for that tenant. Anomaly detection here is genuinely useful and is the kind of monitoring most teams skip because it feels like infrastructure work.

Anomaly detection on tool-call patterns. If a tool that normally gets called 200 times a day is suddenly being called 2000 times an hour, something is wrong. The same anomaly detection used for product metrics works fine for tool-call telemetry.

None of these are clever. They are operational hygiene that is unglamorous and gets skipped until the first runaway loop incident, after which the team builds them all in a panic. Build them before.

Where this leaves you

If you are running an agent in production and the bill is making you uncomfortable, the order to look:

  1. Are you attributing cost per feature, per tenant, per operation? If not, build that first. You cannot fix what you cannot see.
  2. Are you running a fallback to a premium model that is firing more often than you think? Check the logs for fallback rate. If it is above five percent and you have not noticed, it is the bill.
  3. Are your conversations growing without bound? Look at the p95 turn count and the p95 token count per conversation. Long-tail conversations are quietly expensive.
  4. Are you using the same model for everything? Tier the model selection. The savings are often forty to sixty percent on aggregate.
  5. Is caching live? On what fraction of requests? If you have not measured this, you probably have low-hanging fruit.
  6. Do you have per-conversation budgets and tool-depth limits? If not, the next runaway loop is the next budget conversation.

The model bill is rarely just the model bill. It is the sum of all the operational choices that nobody examined when the system was new and the volume was small. Examining them is most of the work. The clever architecture decisions are easy. The hard part is the cost attribution that lets you make them with eyes open.

By the time the system goes live, cost attribution is wired up, model tiering is decided, conversation budgets are set, and the runaway-loop guards are tested. The first month's bill arrives and there are no surprises. That is the goal.

The teams that build the cost discipline before they need it have predictable bills. The teams that wait until the bill is uncomfortable end up spending the same engineering time on it, just under more pressure and with a leadership team that is already nervous.