milebits
7 min readlatency · real-time · fraud · architecture

Sub-10ms decisioning: where the model isn't

In a real-time decisioning system, the language model is not the thing making the decision. It is the system around the decision. Put it in the hot path and you turn a risk engine into a latency incident.

Written by the milebits founders.

A team building real-time fraud scoring, ad bidding, or transaction authorization eventually gets the question from somewhere above them: can we use AI for this? The honest answer is yes, extensively. Just not where the question assumes. In a system that has to decide in single-digit milliseconds, the language model is not the thing making the decision. It is the system around the decision.

That distinction is the whole post. Get it right and the language model makes your fast system smarter every quarter. Get it wrong and you put a model that cannot meet your latency budget on its best day into the one path that has no slack, and you find out in production.

To be precise, because it matters: this is not an argument against AI in the hot path. Small specialized models belong there. A gradient-boosted tree or a compact neural net scores in microseconds to low milliseconds and is exactly the right thing to run inline. The argument is about one specific kind of model, the large language model, whose latency lives one to two orders of magnitude above the budget. LLMs do not belong in sub-10ms decision paths. Small specialized models often do.

The fast path decides inside the budget. The model contributes from offline, async, advisory, and the improvement loop, never from inside the deadline. A live model call crosses the wall and arrives too late.

The sub-10ms constraint

At ten milliseconds you are not negotiating with taste. You are negotiating with physics. The budget is mostly spent on things that are not your model: a network hop, serialization, a feature lookup, the orchestration around the call. A single cross-region round trip can eat the whole thing before any computation happens.

Look at what the real systems publish. In real-time bidding, the auction carries a deadline that includes the round-trip network time, and intermediary exchanges are told to shrink it further to cover their own hop, so a bidder is instructed to target a response time well below the stated deadline. Stripe's Radar scores a transaction across more than a thousand signals in under 100 milliseconds. Those are the generous, end-to-end budgets. The synchronous scoring step inside them is what survives after the network and the plumbing take their cut, and that is where the single-digit-millisecond number comes from.

Now put a language model next to that number. The fastest production models emit their first token in a few hundred milliseconds on public leaderboards, and frontier or reasoning models take longer, sometimes far longer once they are actually thinking. That is one to two orders of magnitude past a 10ms ceiling, and it is the figure before you add the network hop to reach the model. There is no prompt engineering that closes an order-of-magnitude gap. The model is not slow because someone configured it wrong. It is slow because generating language is a fundamentally heavier operation than looking up a feature and adding a score.

What actually sits in the hot path

The machinery that makes a sub-10ms decision is unglamorous and has been in production for years.

Cached features from an online feature store, retrieved in single-digit milliseconds, and on the faster stores in under a millisecond. A small specialized model doing the scoring: gradient-boosted trees like XGBoost, LightGBM, or CatBoost, a logistic regression, a compact neural net compiled to ONNX. These are AI too. They are just not language models, and they score in microseconds to low single-digit milliseconds, which is exactly why they belong in the path. Stripe's inline Radar scorer is one of them, a tabular machine-learning model rather than a language model, and the same is true across payment risk and credit decisioning, where structured data and gradient boosting remain the backbone because they are accurate and they are fast. Rules, for the things that should be deterministic. Precomputed scores, for the things you can decide before the request arrives. And a deterministic fallback for the moment an upstream dependency is slow or down, because a real-time system has to return something correct-enough on its worst day, not just its median one.

This is boring on purpose, and it is built for the tail, not the median. The hot path earns its place by being predictable under load. Nothing about that description is exciting, which is exactly why it works.

Where the language model belongs

It belongs almost everywhere except the hot path, and the value it adds there is real.

Before the decision, offline. A model can generate and test candidate rules that a fast engine then enforces inline. Stripe shows the broader pattern, using AI to shape risk-based Radar rules that the synchronous scorer applies at speed. The model labels and enriches training data. It helps engineer the features the tabular model consumes. Its intelligence reaches the hot path frozen into a rule or a feature, not running live.

After the decision, async. The user already got their answer in single-digit milliseconds. Now the model summarizes the flagged case, drafts the dispute explanation, enriches the record. The work happens off the critical path, where a few seconds cost nothing.

Beside the human, advisory. Analyst copilots that rank and summarize cases, investigation tools that turn a queue of alerts into something a person can work through. The relevant budget here is a human's attention, measured in minutes, not a transaction's, measured in milliseconds.

Inside the improvement loop. The model reads the cases the fast scorer got wrong, finds the pattern, and feeds the next version. The fast system stays fast; it just gets better between releases.

The language model is how the decision got smart. It is not the thing that makes the decision.

The failure mode

The failure is predictable because the incentive is strong. The model is the exciting part, the part everyone wants to point at, so it ends up in the synchronous path where it is most visible. Then one of two things happens. Either it cannot meet the budget, the timeout fires, and the system falls back to the fast path anyway, so the model added latency risk and a new failure mode while contributing nothing to the decision. Or the team quietly raises the latency budget to accommodate it, and a real-time system stops being real-time. A risk engine becomes a latency incident. The thing that was supposed to make the system smarter made it slower and more fragile, and the postmortem is about an architecture decision, not a model.

The operating rule

Use the language model to improve the decision system, not to make every decision in real time. The decision itself is made by something fast and boring, on cached features and a small model and a few rules. The language model makes that fast, boring thing better, continuously, from outside the budget where it has the room to do the heavy work it is actually good at.

None of this is an argument against using AI in real-time systems. It is an argument about where the AI goes. The teams that ship reliable real-time systems are the ones that put the language model everywhere it can do its heavy, valuable work without a clock on it, and kept it out of the one place where the clock is measured in milliseconds. Knowing where not to put the model is most of the design.

Putting an LLM near a real-time decision path? We'll tell you where it belongs and where it doesn't.

Book a 20-min call