How is this different from hiring full-time engineers?

Full-time engineers are usually the right answer eventually. We're the bridge. You hire us when you need someone shipping by Monday and a job posting won't close for months. When you've hired the right team, we hand off everything. The code is already yours, the infra is already in your cloud, the runbook is already written. Working ourselves out of the job is the goal, and we don't mind when it happens.

What if our codebase is a mess?

Most are. The founders' decade-each in production engineering has touched most stack vintages still in use: Rails monoliths, 4-year-old Next.js apps with three router migrations, greenfield TypeScript. We adapt to your conventions, your CI, your branching model. We won't try to rewrite your stack to use our preferred one. That's a tell of an agency that's actually selling templates.

Do you really have no contracts longer than a month?

Real answer: our MSA is signed once, and engagements run on monthly purchase orders. You can pause or end at any time on 30 days' notice. That's not a fine-print clause, it's the operating model. We'd rather stay because we're earning it than because you signed a year of it.

What does 'senior engineer' actually mean here?

Founder-built. Senior operators with a decade in production engineering across AI, infrastructure, and platform work. There is no junior bench to hide behind. You get the people who'd actually be writing the architecture doc anywhere else. We don't post bios on the site because LinkedIn isn't the right hiring surface; we'd rather you meet whoever is staffed on the first call and decide from there.

How fast can you actually start?

Sprint engagements: typically 7 days from signature. Pod engagements: 14 days. We won't lie about availability. Founder-led means slots are real and limited. We'd rather tell you 'we can't take you for six weeks' than start late and miss the timeline.

We've been burned by agencies before. Why is this different?

Mostly because we don't try to be everything. Six services. Senior engineers only. Code in your repo. Monthly contracts. Weekly demos. Founder-led. No PMs in the loop, no offshore handoff, no proprietary platform. If you've been burned, you know what bit you. We've tried to make ourselves the opposite of that.

What does week one actually look like?

Day 1: kickoff call, Slack channel created, repo access exchanged, problem statement written and pinned. Day 2: scoping doc with the smallest shippable thing identified. Day 3–4: working prototype in a sandbox. Day 5: Loom walkthrough, demo on your calendar for next Friday. Week one is choreographed. The improvising starts in week two.

Do you sign NDAs? BAAs? SOC 2 vendor questionnaires?

Yes to all three. Vendor security questionnaires are turned around quickly because we are deliberately small and there's no committee to route around. BAAs are ready for legal review on day one for healthcare engagements. The goal is to be the easiest vendor your procurement team deals with this quarter.

May 23, 20268 min readagents · multi-agent · architecture · production

Most multi-agent systems should be one agent

The multi-agent demo looks incredible. The multi-agent production system is where teams go to debug nondeterminism for a quarter. A multi-agent system is a distributed system where the nodes can hallucinate.

Written by the milebits founders.

The multi-agent architecture diagram is one of the most seductive objects in AI engineering right now. A researcher agent hands to a planner agent hands to a writer agent hands to a critic agent, each box with its own prompt and personality, the whole thing humming along like an org chart that staffed itself. It demos beautifully. It is also, for most of the problems teams point it at, the wrong architecture, and the gap between how good it looks and how it behaves in production is wide.

A multi-agent system is a distributed system where the nodes can hallucinate. Every hard thing about distributed systems still applies, and you have added components that are nondeterministic by design. That is the trade you are making when you split one agent into five. Sometimes it is worth it. Usually it is not, and the single agent with good tools that you talked yourself out of would have shipped sooner and broken less.

Why the demo seduces

Multi-agent decomposition maps onto how we think about human teams, which is why it feels right. You would not ask one person to research, plan, write, and critique a document in a single pass, so it seems natural to give each of those jobs to its own agent. The decomposition reads as clean engineering.

It demos well because in the demo the paths are happy. Each agent gets a reasonable input, produces a reasonable output, passes it along. The narrator can point at each box and explain its job. The architecture explains itself, which makes it persuasive to everyone in the room including the people who will have to operate it.

The seduction is real and the instinct behind it is not stupid. It is just that the thing that makes it readable in a diagram is not the thing that makes it work in production, and those two come apart fast once real inputs arrive.

What you actually pay for

Splitting one agent into several has a bill, and it has line items teams do not price in advance.

Nondeterminism multiplies. One agent has a distribution of possible behaviours. Five agents in a chain have a distribution of distributions, and the failure modes compound across handoffs. When the output is wrong, the question "which agent caused this" does not have a clean answer, because the planner's slightly-off plan produced a reasonable-looking input to the writer that produced a subtly wrong draft that the critic approved because it looked fine in isolation. Nobody is clearly at fault, which means nothing is clearly fixable.

Latency stacks. Each agent in a sequential chain adds its own model call, often several. A four-agent pipeline can be four to ten model round-trips where a single agent would have been one or two. In anything a user waits on, this is the difference between responsive and sluggish, and it is the same latency budget problem made worse by architecture.

Cost multiplies the same way. More agents, more calls, more tokens, more of the hidden cost stack. Each handoff re-establishes context, which means re-sending information the previous agent already had. Anthropic, who run one of the more sophisticated multi-agent systems in production, put a number on this: a single agent uses roughly four times the tokens of a chat interaction, and a multi-agent system uses about fifteen times. That is the cost of the architecture before it does anything useful.

Debugging gets genuinely hard. With one agent you read one trace. With five you are reconstructing a conversation between processes that each made a plausible local decision that summed to a bad global outcome. We have watched teams spend weeks adding logging to multi-agent systems just to reach the level of insight a single-agent system gives you for free. The failure modes are now catalogued: a study, Why Do Multi-Agent LLM Systems Fail?, identified fourteen distinct ways these systems break across three categories, from agents misreading their assignment to skipped verification to agents quietly making conflicting assumptions nobody reconciled. None of those failure modes exist in a system with one agent and explicit control flow.

When one agent with good tools wins

For most of what teams build, the honest architecture is a single agent with a well-designed set of tools and a clear control flow. The "agents" in the multi-agent diagram are usually better expressed as tools the one agent can call, or as steps in an explicit state machine that you, not the model, control.

The researcher agent is a search tool. The writer is the main generation step. The critic is an eval pass or a validation function. Modelled this way, the same work happens, but you keep a single coherent trace, one place where the control flow lives, and a system you can step through. The boring-stack instinct applies here directly: the architecture you can debug at 2am beats the one that reads well on a slide.

The rule we use: the model should make the decisions that genuinely need its judgment, and explicit code should make the decisions that do not. Multi-agent designs tend to hand control flow itself to the models, which means the models are now improvising the orchestration on top of doing the work. Pulling the orchestration back into code you control removes most of the nondeterminism without removing any of the capability. Cognition, the team behind Devin, reached the same place in Don't Build Multi-Agents: their advice was that the simplest reliable shape is a single-threaded linear agent, because once decisions are dispersed across agents that cannot share full context, conflicting assumptions produce fragile results.

When multi-agent genuinely earns it

There are real cases. They are narrower than the hype, and they share a shape.

Genuinely parallel, independent subtasks. If you need to analyse forty documents and the analyses do not depend on each other, running forty agent instances in parallel is sound. This is not really multi-agent in the orchestrated sense. It is one agent, run many times, with no inter-agent reasoning. That pattern is fine and we reach for it freely.

Strong isolation requirements. If two parts of the system operate in different trust or data domains and must not share context, separating them into agents with a controlled interface between them is a legitimate boundary, the same way you would separate two services. Here the separation is doing security work, not decomposition work.

Genuinely open-ended problems with no knowable path. If the space of what the system might need to do is truly unbounded, and you cannot enumerate the flows in advance, a more autonomous multi-agent setup can earn its complexity. This is the case the demos are implicitly selling, and it is real. It is also much rarer than the demos imply. Most production agents serve a knowable set of flows, which means they are a state machine wearing an agent costume, and modelling them as one is the win.

It is worth being fair to the strongest evidence for multi-agent, because it is real. Around the same time Cognition argued against multi-agent, Anthropic published the other side: their multi-agent research system beat a single agent by 90.2% on their internal research eval, exactly the breadth-first, parallelisable case above. The detail that matters is their own: token usage explained about 80% of the performance variance. The multi-agent system was winning in large part because it was spending far more compute. Later work testing this directly, holding the thinking-token budget constant, found single agents match or beat multi-agent setups at equal compute. The lesson is not that multi-agent never helps. It is that a lot of multi-agent wins are really compute wins wearing an architecture costume, and you should know which one you are buying.

The test before you commit to multi-agent: can you write down the flows this system needs to handle? If yes, you probably want one agent and explicit control flow. If genuinely no, if the flows are unbounded and unknowable, multi-agent's complexity may be buying you something. Most teams can write down the flows. They just had not tried before the architecture was chosen.

None of this is an argument that multi-agent is wrong. It is an argument that it is a heavier, more nondeterministic architecture than the diagram admits, and that the burden of proof should sit with adding agents, not with keeping one. The question is not "could we decompose this into agents." You almost always could. The question is whether the decomposition buys you more than the determinism, the single trace, and the lower bill you give up to get it.

For most production systems we are asked to build or rescue, the answer is one agent, good tools, explicit control flow, and a trace you can actually read. It is less impressive on a slide. It is considerably more impressive at 2am when something breaks and you can see exactly why.

Considering multi-agent? We'll tell you honestly if one agent does it better.

Book a 20-min call

More field notes