How is this different from hiring full-time engineers?

Full-time engineers are usually the right answer eventually. We're the bridge. You hire us when you need someone shipping by Monday and a job posting won't close for months. When you've hired the right team, we hand off everything. The code is already yours, the infra is already in your cloud, the runbook is already written. Working ourselves out of the job is the goal, and we don't mind when it happens.

What if our codebase is a mess?

Most are. The founders' decade-each in production engineering has touched most stack vintages still in use: Rails monoliths, 4-year-old Next.js apps with three router migrations, greenfield TypeScript. We adapt to your conventions, your CI, your branching model. We won't try to rewrite your stack to use our preferred one. That's a tell of an agency that's actually selling templates.

Do you really have no contracts longer than a month?

Real answer: our MSA is signed once, and engagements run on monthly purchase orders. You can pause or end at any time on 30 days' notice. That's not a fine-print clause, it's the operating model. We'd rather stay because we're earning it than because you signed a year of it.

What does 'senior engineer' actually mean here?

Founder-built. Senior operators with a decade in production engineering across AI, infrastructure, and platform work. There is no junior bench to hide behind. You get the people who'd actually be writing the architecture doc anywhere else. We don't post bios on the site because LinkedIn isn't the right hiring surface; we'd rather you meet whoever is staffed on the first call and decide from there.

How fast can you actually start?

Sprint engagements: typically 7 days from signature. Pod engagements: 14 days. We won't lie about availability. Founder-led means slots are real and limited. We'd rather tell you 'we can't take you for six weeks' than start late and miss the timeline.

We've been burned by agencies before. Why is this different?

Mostly because we don't try to be everything. Six services. Senior engineers only. Code in your repo. Monthly contracts. Weekly demos. Founder-led. No PMs in the loop, no offshore handoff, no proprietary platform. If you've been burned, you know what bit you. We've tried to make ourselves the opposite of that.

What does week one actually look like?

Day 1: kickoff call, Slack channel created, repo access exchanged, problem statement written and pinned. Day 2: scoping doc with the smallest shippable thing identified. Day 3–4: working prototype in a sandbox. Day 5: Loom walkthrough, demo on your calendar for next Friday. Week one is choreographed. The improvising starts in week two.

Do you sign NDAs? BAAs? SOC 2 vendor questionnaires?

Yes to all three. Vendor security questionnaires are turned around quickly because we are deliberately small and there's no committee to route around. BAAs are ready for legal review on day one for healthcare engagements. The goal is to be the easiest vendor your procurement team deals with this quarter.

June 2, 20267 min readlatency · real-time · fraud · architecture

Sub-10ms decisioning: where the model isn't

In a real-time decisioning system, the language model is not the thing making the decision. It is the system around the decision. Put it in the hot path and you turn a risk engine into a latency incident.

Written by the milebits founders.

A team building real-time fraud scoring, ad bidding, or transaction authorization eventually gets the question from somewhere above them: can we use AI for this? The honest answer is yes, extensively. Just not where the question assumes. In a system that has to decide in single-digit milliseconds, the language model is not the thing making the decision. It is the system around the decision.

That distinction is the whole post. Get it right and the language model makes your fast system smarter every quarter. Get it wrong and you put a model that cannot meet your latency budget on its best day into the one path that has no slack, and you find out in production.

To be precise, because it matters: this is not an argument against AI in the hot path. Small specialized models belong there. A gradient-boosted tree or a compact neural net scores in microseconds to low milliseconds and is exactly the right thing to run inline. The argument is about one specific kind of model, the large language model, whose latency lives one to two orders of magnitude above the budget. LLMs do not belong in sub-10ms decision paths. Small specialized models often do.

A fast hot path of feature lookups, rules, and a small model reaching a decision inside a 10ms budget, then a deadline wall. A language model call crosses the wall and arrives too late. Below the path, four lanes labeled offline, async, advisory, and loop feed the hot path from outside the clock. — The fast path decides inside the budget. The model contributes from offline, async, advisory, and the improvement loop, never from inside the deadline. A live model call crosses the wall and arrives too late.

The sub-10ms constraint

At ten milliseconds you are not negotiating with taste. You are negotiating with physics. The budget is mostly spent on things that are not your model: a network hop, serialization, a feature lookup, the orchestration around the call. A single cross-region round trip can eat the whole thing before any computation happens.

Look at what the real systems publish. In real-time bidding, the auction carries a deadline that includes the round-trip network time, and intermediary exchanges are told to shrink it further to cover their own hop, so a bidder is instructed to target a response time well below the stated deadline. Stripe's Radar scores a transaction across more than a thousand signals in under 100 milliseconds. Those are the generous, end-to-end budgets. The synchronous scoring step inside them is what survives after the network and the plumbing take their cut, and that is where the single-digit-millisecond number comes from.

Now put a language model next to that number. The fastest production models emit their first token in a few hundred milliseconds on public leaderboards, and frontier or reasoning models take longer, sometimes far longer once they are actually thinking. That is one to two orders of magnitude past a 10ms ceiling, and it is the figure before you add the network hop to reach the model. There is no prompt engineering that closes an order-of-magnitude gap. The model is not slow because someone configured it wrong. It is slow because generating language is a fundamentally heavier operation than looking up a feature and adding a score.

What actually sits in the hot path

The machinery that makes a sub-10ms decision is unglamorous and has been in production for years.

Cached features from an online feature store, retrieved in single-digit milliseconds, and on the faster stores in under a millisecond. A small specialized model doing the scoring: gradient-boosted trees like XGBoost, LightGBM, or CatBoost, a logistic regression, a compact neural net compiled to ONNX. These are AI too. They are just not language models, and they score in microseconds to low single-digit milliseconds, which is exactly why they belong in the path. Stripe's inline Radar scorer is one of them, a tabular machine-learning model rather than a language model, and the same is true across payment risk and credit decisioning, where structured data and gradient boosting remain the backbone because they are accurate and they are fast. Rules, for the things that should be deterministic. Precomputed scores, for the things you can decide before the request arrives. And a deterministic fallback for the moment an upstream dependency is slow or down, because a real-time system has to return something correct-enough on its worst day, not just its median one.

This is boring on purpose, and it is built for the tail, not the median. The hot path earns its place by being predictable under load. Nothing about that description is exciting, which is exactly why it works.

Where the language model belongs

It belongs almost everywhere except the hot path, and the value it adds there is real.

Before the decision, offline. A model can generate and test candidate rules that a fast engine then enforces inline. Stripe shows the broader pattern, using AI to shape risk-based Radar rules that the synchronous scorer applies at speed. The model labels and enriches training data. It helps engineer the features the tabular model consumes. Its intelligence reaches the hot path frozen into a rule or a feature, not running live.

After the decision, async. The user already got their answer in single-digit milliseconds. Now the model summarizes the flagged case, drafts the dispute explanation, enriches the record. The work happens off the critical path, where a few seconds cost nothing.

Beside the human, advisory. Analyst copilots that rank and summarize cases, investigation tools that turn a queue of alerts into something a person can work through. The relevant budget here is a human's attention, measured in minutes, not a transaction's, measured in milliseconds.

Inside the improvement loop. The model reads the cases the fast scorer got wrong, finds the pattern, and feeds the next version. The fast system stays fast; it just gets better between releases.

The language model is how the decision got smart. It is not the thing that makes the decision.

The failure mode

The failure is predictable because the incentive is strong. The model is the exciting part, the part everyone wants to point at, so it ends up in the synchronous path where it is most visible. Then one of two things happens. Either it cannot meet the budget, the timeout fires, and the system falls back to the fast path anyway, so the model added latency risk and a new failure mode while contributing nothing to the decision. Or the team quietly raises the latency budget to accommodate it, and a real-time system stops being real-time. A risk engine becomes a latency incident. The thing that was supposed to make the system smarter made it slower and more fragile, and the postmortem is about an architecture decision, not a model.

The operating rule

Use the language model to improve the decision system, not to make every decision in real time. The decision itself is made by something fast and boring, on cached features and a small model and a few rules. The language model makes that fast, boring thing better, continuously, from outside the budget where it has the room to do the heavy work it is actually good at.

None of this is an argument against using AI in real-time systems. It is an argument about where the AI goes. The teams that ship reliable real-time systems are the ones that put the language model everywhere it can do its heavy, valuable work without a clock on it, and kept it out of the one place where the clock is measured in milliseconds. Knowing where not to put the model is most of the design.

Putting an LLM near a real-time decision path? We'll tell you where it belongs and where it doesn't.

Book a 20-min call

More field notes